The Data-Centric Principle

What is the most important principle for building large distributed, enterprise level systems?

It is the data-centric principle : to conceive of data as existing separately from application at ever greater levels of granularity in the system. The fundamental premise is that data naturally composes (whereas applications do not), and it is the idea of shared data across the system that integrates or ties everything together.

It is described in the fourth bullet point in the Database-centric architecture Wikipedia article:

...using a shared database as the basis for communicating between parallel processes in distributed computing applications, as opposed to direct inter-process communication via message passing functions and message-oriented middleware.

In the following diagram we have shared, distributed, managed data across the enterprise, and many different clients running applications that have read or write access to this data (subject to security constraints).

Clients Access Federated Data

Furthermore it is useful to emphasise a distinction between producers of data versus consumers of data:

Producers And Consumers Access Federated Data

According to the data-centric principle, the shared, federated data by definition must encompass all information that is important to the enterprise.

A federated database means a virtual database which is composed from other databases. See the Federated database system wikipedia article.

Informational requirements comes first

A good system architect is aware of the distinction between physical and logical concerns and will first want to ensure that the logical design of the system will fulfil the informational requirements for the enterprise before getting too concerned with implementation details like the following

These are obviously important, but it must be remembered that these are subordinate to the logical requirements as a system comprised of shared data + applications.

The data-centric view is concerned with identifying the informational requirements of the enterprise, and how the users will access the data according to their roles.

Applications

Clients (i.e. end users of the system) have very specific aims and levels of expertise. They want access to small or large subsets of the federated data in ways that makes sense to their role. This is application specific.

Applications can present the same data in many different ways. New applications can be developed that provide new and powerful ways to browse, edit or present the existing data. For example, to provide expert system capability, or to mine the data to present new forms of data summary. So the data-centric principle emphasises the decoupling of data and application at all scales.

Clients tend to exchange information indirectly through the federated data. There is no need for client applications to directly communicate with each other. Otherwise, the system isn't adhering to the data-centric principle.

No Client Communication

Many to many relationship between data and applications

The shared data in an enterprise will typically be partitioned into separate databases. Most generally there will be a many to many relationship between applications and data:

Many to Many Apps And Data

Data management

Ideally the management system for the federated data supports all the following

Clearly these are the kind of features offered by a Database Management System (DBMS). These are complex and difficult features to implement that require significant expertise from very experienced systems programmers. Therefore it is preferable for these features to be provided uniformly across all types of data in the enterprise to amortize their significant development cost. Clearly we cannot expect desktop applications to develop their own proprietary DBMS, and we see that in practice point solutions tend to have very basic support for data management (typically data is simply stored in files). Hopefully there will be a trend for these applications to work on top of a proper DBMS.

For a very large enterprise it is unlikely that a single IT company can provide all these capabilities using their own in-house developed technology. Therefore it is expected that data integration will involve disparate database technologies from many vendors. In addition it is necessary to deal somehow with all the single user desktop applications that don't support the necessary data management capabilities needed in a multi-user environment.

The industry is best served with non-proprietary data formats that allow the storage of the same data in competing database technologies. In other words, an open industry standard for data formats. This more than anything allows a client to be vendor neutral and to make use of the latest innovations in database technology.

Middleware

There have been a number of different middleware technologies over the years, such as

All of these concern Inter-Process Communication (IPC) in the form of messages or commands that are sent between computers. Messages can encompass the transmission of data as well as he transmission of commands. However most of the time programmers associate middleware with the transmission of synchronous commands. i.e. Computer 1 says "do this ...", Computer 2 responds with "ok, done that, what next?". i.e. the messages are like verbs because they are associated with causing actions to take place.

Why didn't RPC, CORBA or DCOM take off and become a de facto standard for distributed computing? Why did HTTP on port 80 become vastly more prevalent? I suggest that the reason is that HTTP on port 80 followed the data centric principle!

Web services and SOA

From Wikipedia we have the following characterisations

Web Service: [http://en.wikipedia.org/wiki/Web_Services]

A Web Service is defined by the W3C as "a software system designed to support interoperable Machine to Machine interaction over a network." Web services are frequently just Web APIs that can be accessed over a network, such as the Internet, and executed on a remote system hosting the requested services. The W3C Web service definition encompasses many different systems, but in common usage the term refers to clients and servers that communicate using XML messages that follow the SOAP standard. Common in both the field and the terminology is the assumption that there is also a machine readable description of the operations supported by the server written in the Web Services Description Language (WSDL).

SOAP: [http://en.wikipedia.org/wiki/SOAP]

SOAP is a protocol for exchanging XML-based messages over computer networks, normally using HTTP/HTTPS. There are several different types of messaging patterns in SOAP, but by far the most common is the Remote Procedure Call (RPC) pattern, in which one network node (the client) sends a request message to another node (the server) and the server immediately sends a response message to the client.

SOA: [http://en.wikipedia.org/wiki/Service-oriented_architecture]

Service Oriented Architecture (SOA) is a computer systems architectural style for creating and using business processes, packaged as services, throughout their lifecycle. SOA also defines and provisions the IT infrastructure to allow different applications to exchange data and participate in business processes. SOA separates functions into distinct units (services), which can be distributed over a network and can be combined and reused to create business applications. These services communicate with each other by passing data from one service to another, or by coordinating an activity between two or more services. Architecture is not tied to a specific technology. It may be implemented using a wide range of technologies, including SOAP, RPC, DCOM, CORBA, Web Services or WCF. SOA can be implemented using one or more of these protocols and, for example, might use a file system mechanism to communicate data conforming to a defined interface specification between processes conforming to the SOA concept. The key is independent services with defined interfaces that can be called to perform their tasks in a standard way, without the service having foreknowledge of the calling application, and without the application having or needing knowledge of how the service actually performs its tasks.

Evidently the emphasis is on decomposing a large system into smaller pieces that exhibit behaviour that interoperate by the exchange of messages, predominately synchronous messages. The references to "coordinating an activity" or the word "service" make it clear that this is an OO analogy extended to the enterprise level. Furthermore the claim of independence (from "how the service actually performs its tasks") is analogous to the OO concept of encapsulation.

This point of view is in conflict with the data-centric principle. In fact it would appear that when people think about SOA / Web services / SOAP they tend to have something like the following picture in mind:

SOA

Anyone familiar with OO will immediately see the close parallel. Objects encapsulate state and only interact by messages that invoke behaviours.

Worse still they will be thinking in terms of

Given that applications do not compose (only data can logically compose), this approach is far less effective for data integration than the data-centric principle.

The World Wide Web (WWW) is data-centric

The Internet (i.e. WWW) also follows a data-centric approach. The federated data can be seen as being recorded over a vast network of web servers whose purpose is to provide access to massive amounts of data (typically in the form of HTML web pages) to clients that can browse this data with just one application: a web browser. This is quite distinct from an application centric approach which requires users to launch many different applications in order to view or edit data.

WWW is data-centric

CEDA is data-centric

CEDA provides interactive collaboration capability, which would seem to be at odds with the data-centric principle, because peers directly connect using TCP/IP in order to communicate without a central server. However, in actual fact a CEDA process straddles both a local copy of some subset of the federated data as well as the client side application, and the message protocol is only for synchronising the shared, replicated data by exchanging small deltas over the wire. So if anything it promotes the whole idea of shared data across the enterprise. This is quite different to applications that communicate with messages that represent commands or actions that need to be performed by the recipient. It is a vital and important distinction, and underlies what the data-centric principle is all about.

CEDA No Client Communication