The Data-Centric Principle

What is the most important principle for building large distributed, enterprise level systems?

It is the data-centric principle : to conceive of data as existing separately from application at ever greater levels of granularity in the system. The fundamental premise is that data naturally composes (whereas services and applications do not), and it is the idea of shared data across the system that integrates or ties everything together.

It is described in the fourth bullet point in the wikipedia article on a Database-centric architecture []:

...using a shared database as the basis for communicating between parallel processes in distributed computing applications, as opposed to direct inter-process communication via message passing functions and message-oriented middleware.

In the following diagram we have shared, distributed, managed data across the enterprise, and many different clients running applications that have read or write access to this data (subject to security constraints).

Furthermore it is useful to emphasise a distinction between producers of data versus consumers of data:

Producers And Consumers Access Federated Data

According to the data-centric principle, the shared, federated data by definition must encompass all information that is important to the enterprise.

A federated database [] means a virtual database which is composed from other databases.

Informational requirements comes first

A good system architect is aware of the distinction between physical and logical concerns and will first want to ensure that the logical design of the system will fulfil the informational requirements for the enterprise before getting too concerned with implementation details like the following

Security
Redundancy
Backup
Deployment
Caching
Process boundaries and middleware

These are obviously important, but it must be remembered that these are subordinate to the logical requirements as a system comprised of shared data + applications.

The data-centric view is concerned with identifying the informational requirements of the enterprise, and how the users will access the data according to their roles.

Applications

Clients (i.e. end users of the system) have very specific aims and levels of expertise. They want access to small or large subsets of the federated data in ways that makes sense to their role. This is application specific.

Applications can present the same data in many different ways. New applications can be developed that provide new and powerful ways to browse, edit or present the existing data. For example, to provide expert system capability, or to mine the data to present new forms of data summary. So the data-centric principle emphasises the decoupling of data and application at all scales.

Clients tend to exchange information indirectly through the federated data. There is no need for client applications to directly communicate with each other. Otherwise, the system isn't adhering to the data-centric principle.

Claimed disadvantages of a centralised approach to data management

In Transaction Management in Microservices it is stated that Microservices should not share access to a database:

Instead they say databases should be private to a Microservice:

Microservices guidelines strongly recommend you to use the Single Repository Principle(SRP), which means each microservice maintains its own database and no other service should access the other service’s database directly. There is no direct and simple way of maintaining ACID principles across multiple databases. This is the way the real challenge lies for transaction management in Microservices.

In a data-centric approach, and making use of data replication the following would be favored

Contrast the previous piture with this one from Transaction Management in Microservices

There are a number of claimed disadvantages of a centralised approach to data management

services needs to go to other services for data access so it produces chatty message exchanges
it requires distributed transactions across services
synchronous access to data in other services is slow
downtime of one service causes downtime of another
one data representation must service more needs increasing complexity
DB schema changes impact more systems

Some people claim a centralised database means services need to go to other services for data access so it produces chatty message exchanges. However, in a data-centric approach there are no messages between services in the first place.

Some people claim (paradoxically) that centralised data means data must be updated transactionally across multiple services. However in a data-centric approach an atomic update just means a single ACID transactions on the centralised database, obviously centralised data makes it less likely to need distributed transactions.

It's sometimes claimed synchronous access to data in other services tends to be slow when services share a database. But latency only depends on the DBMS implementation of caching, locking, sharding, RAID arrays, physical locality etc. Perhaps this view arises because of the overheads introduced by the chatty services themselves.

It's sometimes claimed centralised data means downtime of one service causes downtime of others. However, in large enterprise systems centralised databases tend to be easier to manage, in terms of ensuring it is using appropriate hardware, there is proper backup and fail-over support, it is monitored by a DBA etc. Data replication allows for much better robustness than splitting the databases up into lots of smaller databases. Having dozens of small databases, does not make the system more reliable. This is an area where microservices really complicates things.

It's sometimes claimed centralised data means one data representation must service more needs increasing complexity. This arises from an OO mindset, i.e. when OO developers do database management, and yet they're ignorant of the RM. Most OO programmers seem unaware that the natural way to use the RM is to define many simple relations in the DB schema corresponding to many simple predicates. It is generally easy to add more predicates to record more information, it rarely requires changes to existing predicates. The RM is great for dealing with changes to the requirements for what information needs to be recorded. By contrast OO programmers are quickly overwhelmed when there are many predicates because they are trying to map them onto some pointless class hierarchy.

OO programmers are often worried about the impact of changes to a DB schema on the applications or services written on top of them. This isn't problem, it is solved by an important feature of DBMSs called logical independence.

Many to many relationship between data and applications

The shared data in an enterprise will typically be partitioned into separate databases. Most generally there will be a many to many relationship between applications and data:

Services don't compose, data does

It was mentioned above that applications and services don't compose, only data types and databases do.

A relational database is essentially a set of named relations. The composition of two databases can be regarded as the union of these sets. Relational queries can easily be written that involve multiple databases in this way. The implementation decomposes the query into subqueries for submission to the relevant DBMSs, then the result sets of the subqueries are composed into a result set for the overall query. Another interesting kind of composition is available with database schemas, the idea of DVAs (database valued attributes).

Data types naturally compose, using structs, sets, maps, lists, queues, tuples, relations etc.

To a lesser extent types of state machines can compose. A complicated state machine can be implemented in terms of simpler ones. However complicated state machines are difficult to design and implement and should be avoided.

A service is something which has been deployed which is running. It's a singleton in the enterprise. It's not like a type expressed in a programming language which can be instantiated as many times as one likes to implement other types. From that point of view service composition isn't a particularly useful idea - indeed some people promote the idea that services should be as autonomous as possible.

The fact that services don't naturally compose means a service-centric point of view isn't a good basis for achieving reuse. Even though enterprise systems need applications and services, the focus should be on the data. Naturally the scope should be driven by the business requirements, and it's useful to partition an enterprise into autonomous, cohesive, loosely coupled parts at the level of the business activities. This doesn't conflict with a data-centric point of view.

Proper data management

Ideally the management system for the federated data supports all the feaures of a proper DBMS (see the long list of important features in this discussion of the inadequate persistence system anti-pattern).

For a very large enterprise it is unlikely that a single IT company can provide all these capabilities using their own in-house developed technology. Therefore it is expected that data integration will involve disparate database technologies from many vendors. In addition it is necessary to deal somehow with all the single user desktop applications that don't support the necessary data management capabilities needed in a multi-user environment.

The industry is best served with non-proprietary data formats that allow the storage of the same data in competing database technologies. In other words, an open industry standard for data formats. This more than anything allows a client to be vendor neutral and to make use of the latest innovations in database technology.

Web services and SOA

From Wikipedia we have the following characterisations

Web Service []:

A Web Service is defined by the W3C as "a software system designed to support interoperable Machine to Machine interaction over a network." Web services are frequently just Web APIs that can be accessed over a network, such as the Internet, and executed on a remote system hosting the requested services. The W3C Web service definition encompasses many different systems, but in common usage the term refers to clients and servers that communicate using XML messages that follow the SOAP standard. Common in both the field and the terminology is the assumption that there is also a machine readable description of the operations supported by the server written in the Web Services Description Language (WSDL).

SOAP []:

SOAP is a protocol for exchanging XML-based messages over computer networks, normally using HTTP/HTTPS. There are several different types of messaging patterns in SOAP, but by far the most common is the Remote Procedure Call (RPC) pattern, in which one network node (the client) sends a request message to another node (the server) and the server immediately sends a response message to the client.

The World Wide Web (WWW) is data-centric

Why didn't middleware like RPC, CORBA or DCOM take off and become a de facto standard for distributed computing? Why did HTTP on port 80 become vastly more prevalent? I suggest that the reason is that HTTP on port 80 followed the data centric principle!

The Internet (i.e. WWW) also follows a data-centric approach. The federated data can be seen as being recorded over a vast network of web servers whose purpose is to provide access to massive amounts of data (typically in the form of HTML web pages) to clients that can browse this data with just one application: a web browser. This is quite distinct from an application centric approach which requires users to launch many different applications in order to view or edit data.

CEDA is data-centric

CEDA provides interactive collaboration capability, which would seem to be at odds with the data-centric principle, because peers directly connect using TCP/IP in order to communicate without a central server. However, in actual fact a CEDA process straddles both a local copy of some subset of the federated data as well as the client side application, and the message protocol is only for synchronising the shared, replicated data by exchanging small deltas over the wire. So if anything it promotes the whole idea of shared data across the enterprise. This is quite different to applications that communicate with messages that represent commands or actions that need to be performed by the recipient. It is a vital and important distinction, and underlies what the data-centric principle is all about.