A distributed transaction [] is a database transaction involving multiple participating databases at multiple physical locations.
Distributed transactions are usually required to have all four ACID (atomicity, consistency, isolation, durability) properties. In particular atomicity means either all the local participating databases are updated, or none of them are.
In the book Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery by Gerhard Weikum and Gottfried Vossen it is shown (theorem 19.4): There exists no distributed commit protocol that can guarantee independent process recovery in the presence of multiple failures (e.g., network partitionings).
An even more discouraging result is the FLP impossibility theorem - which appeared in the paper Impossibility of distributed consensus with one faulty process by Michael Fischer, Nancy Lynch and Michael Paterson in 1985. It shows that no completely asynchronous consensus protocol can tolerate even a single process that fail stops, even assuming the message system delivers all messages correctly and exactly once. The FLP impossibility result extends to distributed transactions (because whether to commit or abort requires consensus for all the participating databases). This effectively means that all distributed commit protocols have a window of time in which the delay or inaccessibility of a single process can cause the entire algorithm to wait.
A common algorithm for implementing distributed transactions is the two-phase commit (2PC) [] protocol. Unfortunately 2PC is a blocking protocol, meaning that failure of one host can prevent other hosts from completing the transaction. This happens when every worker has voted yes, but the coordinator fails before sending a message with the final decision to at least one worker. This can mean a transaction holds some resource locks that cannot be released because the transaction never completes. That prevents other transactions from locking those resources. The effect is that a participating database can become unavailable, even though it didn't itself fail and there are no network partitions.
Manual intervention is needed when a failure makes the whole system grind to a halt because nodes have locked resources and they can neither commit or rollback a distributed transaction.
2PC is expensive in terms of the number of messages and forced log writes. 2PC with one coordinator and n cohorts involves 4n messages, 2n+2 forced log writes and 1 unforced log write.
The three-phase commit (3PC) [] protocol is non-blocking, placing an upper bound on the time for a transaction to commit or abort. The original paper Nonblocking commit protocols by Dale Skeen appeared in 1981. Unfortunatly 3PC is not partition tolerant - it cannot recover in the event the network is partitioned in any manner. It also doesn't work with asynchronous communication.
The paper Increasing the Resilience of Distributed and Replicated Database Systems by Keidar and Dolev in 1998, presents an atomic commitment protocol called enhanced three phase commit (E3PC) that always allows a quorum in the system to make progress. It is non-blocking and partition tolerant. Unfortunately like 3PC, it requires at least three round trips to complete, needing a minimum of three round trip times (RTTs).
In 1989 Leslie Lamport developed Paxos, though it wasn't published until 1998. See The Part-Time Parliament. Lamport tried to make the explanation simpler, in the paper Paxos Made Simple in 2001. The desciption presented by Jeff Chase in his Distributed Systems, Failures, and Consensus slides is helpful for those wanting to properly understand Paxos.
The basic steps in Paxos are similar to 2PC:
The key difference from 2PC is that unlike 2PC where all nodes need to agree, only a majority needs to agree.
Raft means Reliable,Replicated,Redundant, And Fault-Tolerant. It's a consensus algorithm designed as an alternative to Paxos, intended to be more understandable than Paxos.
The FLP impossibility theorem can be regarded as a trilemma [] for asynchonous protocols. You have to pick two out of three of the following:
2PC, Paxos and Raft are safe but not live. 3PC is live but not safe.
Distributed transactions tend to be slow because they always involve one or more network round trips (RTTs). Furthermore, they make the database systems using them perform poorly (affecting latency, throughput, and scalability) because resources are locked for one or more RTTs. Also they require Distributed Deadlock Detection
It's easy to find negative comments about them on the Internet:
It comes as a shock to many, but distributed transactions are the bane of high performance and high availability
...Yes they're evil. But are they a necessary evil - that's the real question. ...I believe they are indeed the worst sort: necessary. We're all dooooooomed......
Daniel Abadi says it's time to move on from two phase commit in his blog.
Richard Henderson (from Distributed Transactions Are Evil):
My conclusion is that distributed objects, created implicitly by distributed transactions, are so evil that they should be avoided if possible. If you don't then you have created a difficult management problem that I don't believe should be ignored.
...a number of companies haven't a clue how much danger they are in. They trust to 2-phase transactions and independent backups. They spend huge sums of money on HA paired servers to try and make the physical system failure-proof. At the end of the day, however, it is impossible. Nature finds a way to make your life miserable, and all because of these horrid distributed transactions
Nevertheless there's still ongoing research and projects involving distributed transactions, for example in 2019 an open source project called Fescar was launched to support distributed transactions.
The purported advantage of distributed transactions is that it makes things easier for application developers because it pushes the complexity down into a properly designed and tested subsystem.
What's an example where distributed transactions makes things simpler?
From the point of view of consistency the hard part of distributed business software applications is supporting contingent sequences of transfers between accounts holding money or stock, typically avoiding the potential for negative balances (overdrawn accounts).
But this is easily solved without distributed transactions. Avoiding distributed transactions allows for a much simpler, more robust and efficient solution.
It's common for authors to simply assume the distributed consensus problem is at the heart of distributed computing. For example, in Distributed Consensus Reloaded: Apache ZooKeeper and Replication in Apache Kafka by Flavio Junqueira, it is stated:
Many distributed systems that we build and use currently rely on dependencies like Apache ZooKeeper, Consul, etcd, or even a homebrewed version based on Raft. Although these systems vary on the features they expose, the core is replicated and solves a fundamental problem that virtually any distributed system must solve: agreement. Processes in a distributed system need to agree on a master, on the members of a group, on configuration, on the owner of a lock, and on when to cross a barrier. These are all problems commonly found in the design of distributed systems, and the approach of adopting one of these dependencies has been successful because these systems fundamentally solve the distributed consensus problem
The wikipedia article on consensus [] suggests consensus has fundamental importance:
A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires processes to agree on some data value that is needed during computation. Examples of applications of consensus include whether to commit a transaction to a database, agreeing on the identity of a leader, state machine replication, and atomic broadcasts.
But it looks likely that business software doesn't need a solution to the distributed consensus problem!
Neither does data replication. For example, CEDA database replication is fully asynchronous and uses Operational Transformation, and allows arbitrary fail-stops and network partitions, and any node that is accessible is available (for both reads and writes, without restriction), and any pair of nodes that can exchange messages can synchronise, regardless of whether other nodes have stopped.