Use Your Database to Power State Machines Lawrence Jones
Use your database to power state machines | Lawrence Jones #
Excerpt #
If you build a state machine on top of a relational database you can abstract concurrency problems away from your business logic and allow developers to write safe-by-default code without dealing with concurrency concerns. This post explains how to build a library that offers those protections, and how they work under-the-hood.
Most people are familiar with state machines and know their value. The average state machine library can help you model states, prevent invalid transitions, and produce diagrams that help even non-technical people understand how the code behaves.
This article isnât about making the case for state machines. Itâs about how you take the concept of a state machine and have it work alongside your database models, leveraging your relational database (say Postgres or MySQL) to help you build concurrent-safe and efficient software.
I first encountered this pattern when I joined GoCardless in 2015. Processing bank payments is a multi-day affair and extremely stateful, so itâs no surprise that the team eventually built a library called Statesman that provided a state machine powered by underlying transition tables.
Most of GoCardlessâ critical processes use Statesman, and by the time I left we had transition tables with well over 10B rows. It became such an essential tool that Iâd argue it was a powerful competitive advantage. Statesman is even reflected externally, such as in the GoCardless public API with endpoints providing rich audit trails.
So if you use Ruby, grab Statesman and get going. But for those who want these benefits but are non-Rubyâers, you can implement a small library in your language of choice in just a few hours provided you understand the nuances of transition tables, locking, and edge cases.
Thatâs what I did the other day at incident.io. Hereâs a guide so you can do it too.
Transition tables #
Letâs begin by saying that in most applications, you want to be capturing every transition that a resource has made through a state machine and storing it for later analysis. In some situations, you may even consider the history of transitions to decide which state to transition to next.
Thatâs why the first component of any database state machine will be creating a table to contain transitions.
-- Assuming the payments table is already here.
create table payments (/* ... */);
create table payment_transitions (
id text primary key default generate_ulid() not null,
payment_id text not null references payments(id),
to_state text not null,
most_recent boolean not null,
sort_key integer not null,
created_at timestamptz not null default now(),
updated_at timestamptz not null default now()
);
Each transition row states:
- The parent resource it belongs to (
payment_id
) - The state this transition is moving into (
to_state
) - Whether this transition is the latest (
most_recent
) - An ordinal to allow for logical ordering of transitions (
sort_key
)
Now we add two index-backed unique constraints that are going to ensure our state machineâs integrity.
create unique index idx_payment_transitions_by_parent_most_recent
on payment_transitions
using btree(payment_id, most_recent)
where most_recent;
create unique index idx_payment_transitions_by_parent_sort_key
on payment_transitions
using btree(payment_id, sort_key);
The first ensures we can only ever have a single transition that is most_recent
for any payment, a clear requirement if we ever want to sensibly ask âwhat state is this payment in?â. The second ensures we get no duplicates on sort_key
: less important, but useful to ensure all transitions can be strictly ordered.
Expressing the state machine in code #
Now we have a table weâll use to store transitions, we need to build a library that can be used to express the state machine in our codebase.
In the Go library weâre using at incident.io, we start by expressing the transition table as a domain model that our ORM can work with:
type PaymentTransition struct {
// ID is the unique ID of this transition
ID string `json:"id" gorm:"type:text;primaryKey;default:generate_ulid()"`
// PaymentID is a reference to the parent resource
PaymentID string `json:"payment_id"`
// ToState is where this transition was to
ToState PaymentState `json:"to_state"`
// MostRecent is true when this transition is the most recent
MostRecent bool `json:"most_recent"`
// SortKey provides ordinality over transitions
SortKey int `json:"sort_key"`
// CreatedAt is set upon transition creation
CreatedAt time.Time `json:"created_at"`
// UpdatedAt is set whenever this transition is modified
UpdatedAt time.Time `json:"updated_at"`
}
func (PaymentTransition) Parent() Payment {
return Payment{}
}
func (a PaymentTransition) State() PaymentState {
return a.ToState
}
// ParentColumn tells our machine library what column refers
// to the parent resource.
func (PaymentTransition) ParentColumn() (structField, column string) {
return "PaymentID", "payment_id"
}
Youâll notice PaymentState
is a new type, which we define as a Go enum value:
type PaymentState string
const (
PaymentStatePendingSubmission PaymentState = "pending_submission"
PaymentStateSubmitted PaymentState = "submitted"
PaymentStatePaid PaymentState = "paid"
PaymentStateCancelled PaymentState = "cancelled"
)
Finally, we have the parent resource (Payment
) implement a Machine()
method that ties all this together and tells our library about valid transition paths:
func (p *Payment) Machine() machine.Machine[PaymentTransition, Payment, PaymentState] {
return machine.New[PaymentTransition, Payment, PaymentState](
p, machine.Configure(
machine.From(PaymentStatePendingSubmission).To(
PaymentStateSubmitted,
),
machine.From(PaymentStateSubmitted).To(
PaymentStatePaid, PaymentStateCancelled,
),
),
)
}
machine.New
builds a config struct that records each of the states and where they are permitted to transition into. In our example, that means pending_submission
is allowed to transition to submitted
, but not to paid
or cancelled
.
package machine
type Config[State comparable] struct {
states map[State][]State
}
Executing a transition #
Our code knows everything it needs to model transitions now, and we can transition a payment like so:
transition, err := payment.Machine().
TransitionTo(domain.PaymentStatePendingSubmission).
Execute(ctx, db)
Really simple, but what does this give us? Why is using our database for this better than simply doing things in-memory?
Well, even when there are concurrent processes transitioning the same payment, the transition will only be created if the source -> destination is permitted in our configuration. Thatâs extremely useful for most modern applications which have many processes working at once, but want their developers to write code that acts as if itâs a single-process system.
For Execute()
to provide this guarantee you need to know how your database is configured and be careful about how you take locks. But assuming youâre running Postgres in read committed
mode (the default) then you can copy this code and know youâre safe.
In SQL, as thatâs what Execute()
is really running, the function looks like so:
begin;
-- (1)
update payment_transitions
set most_recent = 'f'
where most_recent
and payment_id = 'PM123'
returning *;
-- (2)
/* use previous result to validate transition from -> to */
-- (3)
insert into payment_transitions (
payment_id, sort_key, most_recent, to_state
)
values (
'PM123', 20 /* previous sort_key + 10 */, 't', 'submitted'
);
commit;
The first update (1) will lock the previously most recent transition, and block any competing transactions (this means they halt until the lock is released).
The first to win is returned the previous transition and can check that moving from the previous state to the next desired state is permitted, and will rollback the transaction if it is not. In the Go implementation I shared above, that means checking the Config
struct to make sure this is permitted.
Finally, if the transition is valid, weâll insert a new transition row for the destination state with most_recent
set and sort_key
advanced from the previous transition row, relying on the unique indexes to catch any conflicts.
Thatâs how a successful transaction that was the first to attempt a transition will run, and any subsequent transactions will follow one of two paths:
(a) Initial transition fails #
If the first transaction fails then we rollback, release the lock, the competing transaction acquires it and proceeds as if the first transaction never occurred.
This reduces to âstart againâ.
(b) Initial transition succeeded #
If the first transaction suceeded, the update will unlock but return zero rows from a read committed Postgres.
This looks to us like either (i) there was no transition and we should assume this is will be the initial entry for this parent or (ii) weâve been beaten by a competing transaction.
We proceed by pretending like itâs the initial transition as in the case of:
- (i) Weâd be correct, and we create the initial transition successfully.
- (ii) Weâre about to try creating another
most_recent
transition and our insert will fail against themost_recent
unique index.
Ergonomic APIs #
Developers who use the machine APIs benefit from the concurrency protections without needing to think too much about them.
When we hit conflicts, the library returns ErrTransitionConflict
and we have a helper that retries if this error is returned. That means you can âsolveâ concurrent processes racing against each other by just wrapping your code in a retry, which would otherwise require someone to think carefully about their concurrency model and bring that complexity into their normal application code.
With all your transitions in the database, it becomes efficient and easy to find all X in state Y by leveraging database indexes. The API and resulting SQL looks like this:
/*
select
*
from payments
join
payment_transitions
on payments.id = payment_transitions.payment_id
and most_recent = 't';
*/
payments, err := domain.NewQuerier[domain.Payment](db).
Joins(machine.InState[domain.PaymentTransition, domain.Payment](db,
domain.PaymentStatePendingSubmission,
)).
Find(ctx)
Weâve also allowed TransitionTo
to receive arbitrary partials that set auxiliary fields on transition tables, so that you can track associated metadata against transitions that need them.
An example would be to attach a submission ID whenever you move a payment to submitted, where that will set a submission_id
column on the resulting transition row:
transition, err := payment.Machine().
TransitionTo(domain.PaymentStatePendingSubmission).
With(
domain.PaymentTransitionBuilder.SubmissionID(payload.SubmissionID),
)
Execute(ctx, db)
Just try it #
Iâve use banking examples in this post, but donât think this is limited to payments. We use this at incident.io to:
- Power incident updates, where each update has a message, severity, status etc and
incident_updates
is the transition table. - Track the status of alert sources so we can record when something was setup, activated, or deactivated and why that happened.
- Our nudges feature creates ânudge runsâ that are transitioned as we work them.
Provided you follow the instructions Iâve given, youâll inherit the concurrency protections and if you try it out, I promise youâll soon realise their value. Even if itâs when your Data team notice how easy it is to build their models, because thereâs much richer data available and arranged consistently than you have before.
Low cost, high upside. Give it a shot.
If you liked this post and want to see more, follow me at @lawrjones.