Reactive Systems: How to Fail Nicely
In the previous post in this series, we talked about the performance and scalability benefits that we get from actors and the actor system. I also briefly mentioned fault tolerance and promised to cover it in another blog post. Not being one to shy away from a promise I’ve made (except all those times where I’ve shied away from a promise I’ve made), let’s talk about fault tolerance in Akka!
The supervisor strategy
An actor system consists of a hierarchy or tree of actors. That is, each actor is itself capable of creating zero or more new child actors which it then supervises. Being a supervisor means that you have the option of being notified of lifecycle events of your child actor(s): when they [re]start, when they stop, and – most important for this post – when they encounter an error.
The parent actor is able to define what to do when a child actor throws an error. This is called a supervisor strategy, and lets us react differently to different errors that may occur during a child actor’s execution. The supervisor strategy gives us a nice way to be fault tolerant: we can gracefully handle different classes of errors in whatever way we’d like, and we can also decide what to do with the child actor that has just encountered an error: resume, restart or terminate.
Fault Tolerance in IPF
In IPF we have the concept of an endpoint – our representation of an internal bank system with which we must communicate in order to process an instant payment. Good examples of endpoints that IPF would talk to include: accounting systems, fraud systems, AML, sanctions, reporting,… I could go on.
Nobody’s perfect, and from time to time these bank systems may encounter a failure or – worse – a period of unavailability. In order to gracefully handle such issues and avoid cascading failures, we use some Akka features such as supervisor strategies, become/unbecome, and stashing at the endpoint level in order to protect us from constantly retrying a failing endpoint, and also to preserve message order for when the endpoint is available again.
IPF keeps track of the number of times the endpoint has failed to respond to us, or returned an unexpected error. It’s a circuit breaker of sorts in that we have a threshold of the number of times the endpoint failed to respond, at which point we mark the endpoint as unavailable. When an endpoint is unavailable, we won’t try to contact it again until a later point in time when we know that it’s available once again
Recovery
While an endpoint is unavailable, we stash messages that we received during the downtime. When the endpoint becomes available again, we replay all stashed messages in the same order. We call this the “recovery” phase, after which the endpoint becomes fully available again.
There will also be the situation where we’re still receiving requests while we’re recovering. After all this is an always-on, zero downtime payments engine and we would theoretically constantly be sending and receiving payment messages. So how do we deal with messages coming in while we’re trying to recover, and avoid being infinitely stuck in recovery hell?
It’s unfortunately not that exciting but makes a lot of sense. We enter a “hold” state where we temporarily let live messages queue up while we drain the stashed messages we received while unavailable. When that’s done, we start consuming from the live queue, and we’re back in business!
Conclusion
Akka provides us with a super handy box of goodies for handling errors when they inevitably happen. The goal is to push error handling as far down the hierarchy as possible where we can preempt errors and know exactly what to do when they occur. And in the reactive world that “what to do” might very well just be to turn it off and on again.
At first it might be a little counterintuitive to just make an actor restart when it encounters an error — restarting a child actor is in fact the default Akka supervisor strategy — but it turns out that “have you tried turning it off and on again?” really is the best way to handle most errors. The reality is that if a child actor has failed, it’s probably done so because it’s gotten itself into some inconsistent state and, as a result, a restart might help.
IPF’s circuit breaker implementation ensures that no messages are lost when an endpoint is unavailable and that messages are replayed in the same order when the system is back. We also have a way to escape from an Infinite Loop of Doom (trademark pending) where there’s just one more message that we need to process, and remain in a “hold” state forever.
Finally, you may have been screaming at your screen for the past few minutes you’ve spent reading this article, trying to tell me that Akka itself has its own circuit breaker implementation. I’ve only recently come across this and so we’ll have to look at adopting that instead of our homemade one to make our product even better.
This blog is part of a series, read the previous post here, and the next one here.