Extend SAGA to the cloud for distributed transactions

Posted by André on 2024-02-27.

In 2017 I wrote an article on managing transactions in the cloud [1], that described common patterns to achieve consistency across multiple components and Microservices. The world has moved on, and I have worked on an IBM Academy of Technology study on how to achieve transactional semantics with Microservices. Of this I want to report here.

Note: this article has originally been published on IBM's architecture website end of 2020, and is here republished within IBM's publishing guidelines.

Transactions so far

Transactional consistency is an important property of a system, whether that system is a financial transaction, a web-based shopping application, or a travel reservation system. There are serveral common patterns for achieving transactional consistency across components. In these Patterns, you can see two main approaches:

Use a classical programming model and let a transactional infrastructure manage everything that is "behind the scenes". This approach helps developers to concentrate on the business operation.
Separate the business operation into two or three suboperations with an intermediary state in between. This approach requires that the developers know how to seaprate a business operation to first create and then resolve - in either direction - an intermediary state.

One example of the first approach is teh classical two-phase-commit transaction. However, in the cloud or across microservices, two-phase-commit has certain drawbacks. The main example of the second approach is the SAGA.

Shortcomings of traditional solutions

SAGAs are a standard solution to achieve consistency across microservices. SAGAs are driven by either a central orchestrator that handles the workflow of the operations and the transaction state or be event-driven SAGAs. One shortcoming of event-driven SAGAs is the lack of a central infrastructure.

Solutions like SAGAs that have a "know-it-all" orchestrator can present scalability issues. Choreography approaches that use messaging can be hard to manage. In two-phase commit solutions, heuristic situations can appear. This solution assumes that a specific outcome of a transaction occurs on timeout. This assumption might be safe in a controlled environment, but in many situations, it can result in inconsistencies that are difficult to resolve.

A new way to implement SAGAs

A new protocol, Transactions for Microservices (TXMS), is based on SAGA and uses this set of assumptions:

TXMS assumes that an operation in a microservice can be compensated. This assumption is generalized further so that the protocol has a forward business operation and a compensate operation. The protocol also allows for a complete operation.
The protocol is eventually consistent, cloud native, and polyglot. A simple RESTful API is defined that can be used with any programming language that supports that format.
Synchronous, asynchronous and mixed operations must be supported along with short- and long-running transactions.
TXMS uses the ubiquitous pattern of idempotency and retries as a means to recover from error situations.
It avoids heuristic situations by defining a single source of truth, in the form of a cloud native transaction manager service.
It supports discoverability so that active and older transactions, their status and their participants can be discovered

The TXMS protocol

There are several components of a transaction:

A source for a unique transaction ID
Participants: where the business logic is implemented
A single source of truth for each transaction; that is, the information about whether the transaction is still ongoing, is to be rolled back, or is to be completed
An orchestrator

In an orchestrated SAGA, the initiator and finalizer are usually part of the orchestrator (or rather, the orchestrator play the role of the initiator and finalizer). With TXMS, the orchestrator is optional; the task to initiate and close a transaction can be done by separate components, like a bff (backend-for-frontend) service, or separate microservices that fulfil the role of initiator and finalizer in the context of the transaciton. TXMS borrows a concept from the Resource Recovery Service (RRS), which is a component of the IBM(R) z/OS(R), in the cloud RRS turns around the transaction control by using a transaction manager as a service where a participant can register to get the outcome of the transaction.

The transaction manager can accomplish these tasks

Create a unique transaction ID, which is basically a REST URI for the transaction
Register participants for that transaction ID
Keep the current state of the transaction
Notify participants of the outcome of the transaction
Allow REST API discovery of transaction states and particpants

A participant then registers a callback with the transaction manager and passes the transaction ID downstream to any other called service.

The TXMS protocol separates the business action of a participant into a forward operation for the business part and a compensate and complete operation to finish the transaction. The latter two are called by the transaction manager

The following diagram shows a classical operation that uses multiple participants and includes an orchestrator that initializes the transaction, calls two of the three participants, and completes the transaction. Arrows show the control flow and each arrow has an implicit return path that communicates the reply from the called service.

Unlike SAGA, TXMS participants can be called hierachically and can still participate in the transaction. Thus the protocol is a superset of SAGA.

An example check-out application

A web-based checkout application can illustrate the protocol. In this application, several steps must be completed:

Check and potentially reserver the inventory
Calculate price
Process payment
Trigger fulfillment
Send order confirmation email

Each step in this process can go wrong. Some operations, such as sending an email, cannot be undine if the operation needs to be rolled back.

TXMS provides space for the optimization of this process. Checking and potentially reserving the inventory can be the forward operation of a fulfillment service call. The complete operation can then trigger the fulfillment.

The order confirmation mail can be sent in the complete operation of the email service where it is clear that the order will be processed. Because the transaction manager repeats complete calls on failure, sending the order confirmation email must be made idempotent.

The pricing service probably doesn't even need to be included in the transaction, as it is read-only and might decide not to register with the transaction manager

The payment service can reserve the amoun that is needed on the credit card. After the complete operation, it can asynchronously follow up with the payment processing or withdraw the reservation on compensate.

So, the called services need to handle the separation into a forward and complete or compensate call, where the explicit coding of complete opens up potential for optimization. On the other side, a caller or orchestrator can still look conventional:

@TXMS(id=Cart.id)
public int checkout(Cart pCart) {
	Order iOrder = order.create(pCart);
	fulfillment.process(iOrder);
	payment.transfer(iOrder);
	mailService.sendConfirmation(iOrder);
	return OK;
}

In this simplified example, the @TXMS annotation can create a transaction ID in an idempotent way by using the transaction manager. In all called services, the transaction ID is injected into their downstream calls. If needed, those services can then register with the transaction manager to receive the result of the transaction. The @TXMS annotation can then detect any error situation and notify the transaction manager of a completion of compensate situation.

The most frequently ased question is, "Isn't the transaction manager a single point of failure or a scalability problem?" The transaction manager scales as well as any other microservice. You can distribute transaction IDs across database shards to spread the workload. The transaction manager employs a data structure that is like an even tlog to store the information for a transaction. This event log must persist in a reliable manner so that, for example, you can run the event log with quorum writes an a distributed database. However, because of the idempotency and retry mechanisms that are defined in the protocol, only the actual outcome needs to be reliably written to optimize performance.

TXMS is a cloud-native transaction protocol that is reliable, performant, polyglott, and cloud-native and can help create consistency across microservices.