CEDA

CEDA is a high performance database technology suitable for the management of all sorts of data, such as

The name CEDA is derived from "DAta CEntric". This is in reference to the Data-Centric Principle

The perfectly smooth panning and zooming of images on the scale of Google Earth on a low performance laptop highlights the exceptional performance characteristics of the database engine.

CEDA can scale from one user to many thousands of users which interactively edit very complex and large amounts of data in applications that feel as responsive as single user applications. The data is replicated and synchronised efficiently, and users are able to view and edit their local copy of the data independently of network latency and network failures. This is achieved using Operational Transformation. The unique and revolutionary algorithms in CEDA typically merge hours of off-line data entry in a fraction of a second.

One of the prime motivations of CEDA is to support robust distributed data management, despite the fact that networks tend to be unreliable, have high latency, low bandwidth etc (see the eight fallacies of distributed computing).

Data is replicated

CEDA allows data to be replicated meaning each site has its own copy of the data stored in its own local database. Operations (updates to the database) are exchanged between sites over the network in order to synchronise the databases.

Replication

Applications only access the local database

Applications that view and edit the data only access the local database. The local access is high bandwidth, low latency and always available. Even though applications are multi-user they feel as responsive as single user applications.

The applications don't directly communicate. Instead the CEDA distributed DBMS does all the work. Applications just access the local database and don't have to worry about the network, latency, replication and synchronisation.

Apps

Operational Transformation

A key feature is the use of Operational Transformation (OT) to allow user edits to be performed on a local database without latency, and changes (called operations) are sent and merged asynchronously without the need for distributed transactions.

OT allows operations to be applied in different orders at different sites. In the following picture site A generates an operation O1 and concurrently site B generates an operation O2. These operations are applied immediately on the database where they are generated.

Two Sites

Each operation is sent to the peer where it is transformed in such a way that it can be applied in a new context and yet preserves the original intention of the operation. So the two sites execute the operations in different orders and yet they reach the same final state.

For more information see the Operational Transformation Wikipedia article.

OT on Text

The CEDA merge algorithms allow for arbitrary merging of changes to text.

Merge On Text

The result is of a merge is always defined, and it is always the union of all the insertions minus the union of all the deletions. The relative order of the inserted characters is always maintained when merging operations.

OT on many variables

The data is represented in a hierarchical manner using sets, tuples, maps, arrays and objects. The operations are typically very fine grained and target the leaf variables which tend to be simple data types. The leaf variables are assumed to be independently updatable.

Operations Target Leaf Vars

Operations are streamed between sites

CEDA synchronises databases without distributed transactions.

Stream Operations

The sender only reads its local database to enqueue operations in a send buffer. The receiver updates its local database with local atomic non-durable transactions. There is no need to synchronise with the sender because machines don't persist information about what operations have been applied on peers - indeed the receiver could crash and roll back to an earlier state, and when it next connects it may receive operations it had previously received before it crashed.

Flushing (durability) on the receiver is not needed for correctness. See the Durability Wikipedia article.

Note that transfers between bank accounts of different banks can be achieved without distributed transactions, by using queued transactions.

Arbitrary network topologies

CEDA allows for many sites (even hundreds of thousands) in arbitrary network topologies.

The following picture shows an operation generated at one site and propagating efficiently through the network, to be delivered to all sites. At the same time other sites can be generating operations as well, so that at any given time there can be many operations propagating in different directions through the network.

Network

Multi-master

Operations on the same data can be generated at different sites regardless of network partitions. In the database theory literature this is called multi-master replication and is known to be highly desirable but very challenging to implement. Indeed there have been hundreds of papers on the subject in the past 40 years. It is like the Holy Grail for data management systems, especially large distributed systems.

It is also called update-anywhere-anytime replication, because a user is always able to update their local copy of the data, even if they are disconnected from other computers. Indeed the network can be very unreliable, dropping out intermittently all the time, and yet users continue working on their local copy, immune to these problems. The editing experience is unaffected by network latency or disconnections. It means multi-user software is just as responsive as single user software.

Multi-master replication is known to be difficult because of some negative results which have been established in the literature, such as the CAP theorem. CEDA addresses the limitations of the CAP theorem by allowing sites to temporarily diverge as operations are performed in different orders. This is sometimes called eventual consistency.

Once all sites have received all operations they necessarily converge to the same state. CEDA does not compromise on availability and partition tolerance (in contrast to systems which don't are therefore are fragile). When there is a network failure users are able to continue updating their local copies of the data, they are autonomous. The algorithms are very robust, and allow for redundant data paths, failed connections, changes to the network topology and numerous other failure conditions.

In fact CEDA is well suited to replication in extremely unreliable networks. It even allows connections to be broken every few seconds and yet allows robust synchronisation of replicated data. This has been proven to work with reconnections in arbitrary network topologies that change over time. Computers can even connect that have never directly connected before in the past and exchange operations that were received from other peers. The CEDA replication system was first implemented 8 years ago and has been subjected to ongoing, heavy testing with billions of randomly generated operations on randomly generated network topologies with randomly generated reconnections.

Another negative result in the literature is a paper showing that under quite general assumptions replication is not scalable because the computational load increases 1000 fold when the system is 10 times larger. This can easily mean a system cannot keep up with a very modest transaction rate, much to the surprise of its developers. Such a situation is unrecoverable because the load increases as the divergence in the copies of the database increases.

As a result many systems tend to only use master-slave replication. This means updates can only be applied to one computer (the master) and updates only propagate in one direction to all the slaves. This is quite limiting compared to update-anywhere-anytime replication. E.g. users cannot work if they cannot connect to the master and the data entry application may seem sluggish because of network latency (i.e. the time for messages to be sent to and from the master over the network).

Nevertheless CEDA has a computational load which is proportional to the size of the system, possible because it avoids the assumptions in the literature than imply replication cannot scale. In fact the algorithms are extraordinarily efficient.

Google have tried to support update-anywhere-anytime with Google Wave, a project that caught the interest of industry experts for its exciting proposal to use Operational Transformation to achieve multi-master replication, but their solution doesn't satisfy a mathematical property in the literature called TP2, which means it is not able to merge changes in arbitrary orders for arbitrary network topologies.

CEDA was compared to ObjectStore (a popular object oriented DBMS) in 2004 and CEDA was found to achieve 100 times the transaction throughput in a multi-user database system on a LAN. The benefits of CEDA would have been even greater on a WAN. This is essentially because CEDA uses multi-master replication with fully asynchronous operations, whereas Object Store uses distributed transactions, multi-phase commit and pessimistic locking. ObjectStore is using the conventional approach still emphasised in the database literature today, but which exhibits both poor performance and can't be made robust to network partitions because of the theoretical impossibility of guaranteeing all sites agree to commit a distributed transaction or not when the network is unreliable.

Keys features include:

The lattice of database states

The database states that preserve causality form a mathematical structure called a lattice.

Lattice Of States

Jigsaw example

Position of connected jigsaw pieces

The Jigsaw derived representation provides an example of how OT can be used on an unconstrained base representation and yet a derived representation imposes constraints (in this case on the relative positions of the jigsaw pieces).

Primer on Operational Transform

This primer on Operational Transform may be of interest to those wanting to know more about the underlying theory.

Configuration Management

OT is well suited to both real-time interactive collaboration and configuration management. The latter involves working in isolation and manually controlling when to check-in, check-out, create branches and merge branches.

Configuration Management