Long running transactions (sagas)

A sequence of contingent business activities might take hours, days or weeks or even longer to complete. It sometimes corresponds to what is called a long running transaction or saga in the software industry.

Distributed transactions aren't suitable for long running contingent business activities, it is not reasonable to lock resources like bank accounts for weeks.

The contingent nature of business activities must be properly modelled by business software and that means there's a need for compensating transactions. This is inevitable. For example, a buyer may get a refund when a received good is defective, or a a buyer may have the right to cancel a contract within a a cooling off period.

Consider the case of a buyer that purchases goods from a seller. This starts from when the buyer first raises a purchase order (PO) through to when they receive the goods.

The buyer may check that the goods are currently in stock. This is only suggestive information, there is no guarantee that they will be in stock when the order is placed.

The following shows how some contingent business activities can be modelled:

  1. The buyer raises a PO. This may contain conditions, for example be subject to receival of the goods in a certain way or by a certain time. The issue of a PO does not itself form a contract. However, acceptance of the order by the seller forms a contract between the buyer and seller. Therefore, at the time the buyer raises a PO they need to ensure they will have enough money in their bank account to cover the purchase. The distributed transaction (aka 2PC) way to ensure this is to lock their bank account in the "prepare" phase! This is obviously unreasonable. Instead, it is better to actually transfer money into a "holding account" so it is reserved. This corresponds to our approach of pushing a deposit record onto a persistent queue. More specifically the creation of a PO is modelled by a local transaction on the buyer's database which atomically:
    • deducts money from the buyer's bank account if it's in the buyer's database, or else by incrementing a sequence number which represents the processing of money sent by another database
    • pushes a PO into a local persistent queue. This represents a conditional deposit record (i.e. a promise to send money which is conditional on acceptance by the seller).
  2. The seller is sent the PO over a transient message queue. The seller checks whether they are willing to commit to the contract. The PO is processed in a local transaction on the seller's database, which atomically
    • increments a local sequence number for the number of PO's from the buyer which have been processed
    • if the seller accepts the order then it
      • pushes an invoice record onto a local persistent queue
      • decrements a stock account in its inventory
      • pushes a delivery record onto a local persistent queue, this represents a commitment to ship the product
    • otherwise if the seller rejects the order then it pushes a PO rejection record onto a local persistent queue.
  3. If the buyer processes a purchase order rejection record then they reverse the payment

This is a very direct and straightforward modelling of the contingent business activities, and everywhere involves double entry book-keeping.

It's all about the data in the databases and avoiding notions of distributed consistency constraints. Who owns what money/stock is recorded in databases and only in databases.

It has nothing to do with the state of the transient messaging system, such as what messages are queued to be sent, or are in flight or have been received. The messages are transient and therefore play second fiddle to the data-centric point of view.

This is a different point of view to most of the software industry, which is fixated on "service oriented" and "message oriented" architectures - i.e. state machines and messages, when the emphasis should really be on modelling the state of contingent business activities with pure data in databases.

More specifically the software industry is fixated on hideous approaches involving anti-patterns like 2PC, at least once delivery, persistent messages, duplicate message testing, idempotent message handling, retry loops, message replay, timeouts, dead letter queues [] etc.