By Simon Woodhead
July was not a month we’re proud of for reliability and I feel you deserve some explanation. Let’s be clear though, I’m not using the word ‘outage’ here because we haven’t had one of those this decade. Even ‘service interruption’ seems strong as the majority of customers were unaffected although, of those few who were affected, some were affected on more than one occasion. Whatever the right word is, we feel we let affected customers down and wanted to give some clarity.
Of course, by doing so publicly we’re making a wider audience aware of issues they may not have experienced. We’re also empowering “me too’s” army of shiny-suited sales droids to show their own lack of reliability in a better light. I don’t think there’s any comparison to be had there and rely on them being unable to comprehend the facts that follow, whilst the clued-up reader forms an intelligent opinion.
We had a few incidents around the middle of the month. Generally, these related to channel limits being enforced incorrectly, or equally badly, not enforced. As these are a key component of our unique fraud protections (see our talk at Asterisk World for more) and disruptive for those with calls wrongly rejected, it shouldn’t happen. Affected accounts were also affected in every Availability Zone; the state of an account is one of few things that is shared network-wide and is consistent across all Availability Zones. This is important for us to offer consistency of behaviour wherever on the network customers route calls, rather than tying one-customer to one-magic-box as is the conventional approach.
It took us an awfully long time to understand these issues. We initially believed we were not handling call events quickly enough causing a backlog and we made a number of changes to improve this to little sustained avail. This seemed particularly odd given how dramatically we’d improved event handling performance late last year, but it was the only explanation at the time.
We use Redis for all of our line-of-business database work, notably authenticating and routing calls. A call never touches a traditional relational database until it is finished and being billed. This enables us to not only do 200+ fraud checks per single call but also to route a call in about 10ms which is many many times quicker than others who are not doing any fraud checks. However, this is entirely reading from Redis whereas events are instead writing to it.
We use different nodes for reading and writing respectively – we read from anycasted slaves all around the network, and we write to a single unicast master as Redis uses a single master replication model. Thus whilst all slaves were performing fine and routing calls as quickly as ever, call events were being delayed or sometimes lost, rendering the data for channel counts inaccurate. This was the root of the problem as those values were being checked when routing calls.
Thankfully, we already had some ‘chaos monkey’ work scheduled for a day or so later. The plan here was to go through the exercise of deleting our master/write Redis node and verify the work we’d done earlier in the year. All being well, a new master should have been elected from our pool of slaves, and services migrate to use it seamlessly. This was a very scary thing to do and frankly something I’d challenge all our competitors to demonstrate because, as one of my team put it: “This is going to happen for real, better it happens in a planned way at the quietest time”. He’s so right and thankfully the team’s amazing work paid off. There were some tiny actions resulting from it but fundamentally service was unaffected and it was a resounding success. We’ll be doing a lot more of this type of thing as it is hugely reassuring having a database that self-healed a few weeks ago, rather than one shrouded in cotton wool that has run uninterrupted for years – they’ll both inevitably die but we know one will recover rather than hoping it might. Uptime is great but, as time elapses, uncertainty over recovery grows; testing and proving failure recovery is essential.
One unexpected outcome from this was that all our write performance issues evaporated and Redis was again screaming along despite handling hundreds of thousands of events per second. Deleting the problematic master and a fully functional slave being elected in its place had overcome the root issue of latency. It transpired that in our move to containerisation we’d gone from a host operating system which by default denies Transparent Huge Pages (THP), to one which by default always uses them. Redis hates THP and, in short, the latency of queries had increased slowly over the months this master node had been operating to a point where we couldn’t write sufficiently quickly enough during peak hours. This problem hasn’t gone away as the replacement host is similarly configured, but it took 6 or 7 months last time and we know what we need to do – we’ll be scheduling another maintenance window to delete the master again, instead replacing it with newly redeployed slave on an appropriate OS.
There’s many lessons here for us, and our contriteness is undiminished, but I hope the fuller explanation has given you confidence in our team and architecture as it has me.
If you were unaware of any issues, please see our status page where you can subscribe for notifications. Instead, if you’d like to know more about our architecture and philosophy, the following videos may be of interest: