Our Architecture and the AutoPilot Pattern

5th April 2018

By Simon Woodhead

Back at SimCon1, I presented the first overview of our architecture and how our interconnects, network, and platform operate to give maximum performance and resilience to issues. I say first as how we do things, and more importantly why we do things the way we do, is a big topic. There’s a video of that talk and subsequent presentations will dive into the software side of things. However, I promised an intro to a key philosophy of ours and that is the use of the AutoPilot pattern.

Orchestration

I explained how, for anyone starting out in containerisation, there are a mind-blowing number of decisions to make and complex architectures to build just to get to the ‘start’ and one of them relates to orchestration. Anyone installed Kubernetes for example? Of course, those decisions can be made for you if you use the public cloud but then you’re locked in to somebody else’s decision. Most importantly, whichever way you go you’re highly unlikely to have that environment on your laptop, breaking one of the very promises of containerisation, i.e. if it runs on your laptop, it’ll run in production.

At Simwood, after many months R&D, we decided to go a different way and keep life simple. We’re not managing hundreds of thousands of hosts worldwide, so don’t need to be encumbered by the tools to do so. However, we do manage (large!) hosts numbering in the high 10s and containers and legacy/deprecated Virtual Machines running into the thousands, and they are worldwide in multiple datacentres. So to conclude we’re too small for orchestration would be wrong, we’re just not big enough for the problems of big orchestration to be worth the effort. We’ve tripled traffic on the network in the last few months alone though, and brought online other new datacentres, so we also know a bit about the challenges of growth and have a need to scale. We just don’t need it enough for the costs, notably in DevOps agility and the huge wins we’ve made there!

Autopilot

So how do we manage all this stuff? We rely on a technique called AutoPilot which essentially means a container is self-describing and self-managing, and thus runs the same anywhere.

I’ve already described how we replaced the Docker network stack in order to make containers first class citizens on the network, enable anycast, and to put our Devs in control of both network requirements and firewalling. It is all described in the container config, which is all in a repository. We’re good – deploy that container anywhere on the network and magic happens automatically.

But what about application variables? Take a database for example, do we define the master in code? Hell no, although in our legacy stack where the master was well-defined and had been up for 5+ years that could have been a consideration. Do we push the master address to it on deployment? Well, we could but that is error prone and what happens when things change, i.e. the master fails? And what about data, given if we’re deploying a new database node there will be a (generally) huge dataset to import before we start. Ideally we do not want multiple slaves pulling huge datasets from the master, especially in a scenario where we’re recovering from an issue or adding scale given both imply the master will be busy enough.

For some databases there is native clustering – Elasticsearch and Galera being two examples that spring to mind and we use. They’re relatively easy, but AutoPilot still has a place. For others such as Redis, AutoPilot saves our bacon, let me try and explain how;

So we have a single image for Redis company-wide, regardless of location or role. On deployment, the network and firewall are taken care of as I described above, including elements such as the address that can vary between datacentres. So that gets us a running Redis instance with no data and no master nor slave. How do we make it useful?

Well, where our instance is the master, it is dumping its database every few minutes and we’re backing that up off-site. We do that for slaves too just for belt-and-braces. This file is many gigabytes in size and, more significantly, the changes to the data exceed the size of the dataset in under 10 minutes – there’s many many thousands of writes per second and this is changing fast. So our new node knows it needs data and is configured to download and import the latest dataset from the repo. So now we have a node with recent data.

AutoPilot also enables us to discover the master. We could do this by storing the master IP in a key-value store, or DNS, but they have a number of challenges and dependencies. Our approach is actually a combination of these; a Redis instance will assume itself to be the master but AutoPilot will discover who our master currently is, with an opportunity to override it so we can safely run the container in the lab, or on a dev laptop, without it joining the production cluster.

So we now have a running slave, and whilst it has got there autonomously, the process is blocking – loading n0GB into RAM is blocking, even for Redis. Given our massive use of anycast, every read node shares an anycast IP address but this means as soon as that IP address is available, that node will get some of the read hits. We don’t want this while it is blocking. AutoPilot thus delays announcing this IP address to the network until the node is actually ready-for-service.

Bingo, we can now deploy a new node, have it get current data, join the cluster, and bring itself into service anywhere in the world autonomously thanks to AutoPilot.

But what about after deployment? Well, each service is different and in the particular case of Redis we use Sentinel which simply discovers and monitors running slaves, electing a new one should the master fail. We’re not so lucky with other services though, especially where applications inter-relate – a perfect example being Freeswitch containers behind Kamailio load-balancers. AutoPilot saves us again – new Freeswitch nodes announce their availability for service, get added to Kamailio via an onChange process, and similarly Freeswitch nodes removed from service are removed from Kamailio. This is all automatic!

Between all that AutoPilot breaks the back of the problem orchestration is trying to solve, but in a way that works well for us! Running a global voice operator, with Regulatory requirements for uptime, is a very different challenge to most DevOps projects. Similarly I don’t know of any other voice operator that would or could embrace a DevOps approach – virtualisation is the new kid for many, and that is often to run the magic-appliance replacing the same vendor’s magic-box. So Simwood is unique and unique challenges need unique solutions.

In the footsteps of giants

I’ve portrayed AutoPilot as a ‘thing’ above but really, the ‘thing’ is our implementation of it and given how unique that is, I’ve not dived into the ‘how’ at all. It is a philosophy that you may implement in your own way to the extent that feels comfortable but it is really important to say it isn’t a Simwood-original philosophy – we borrowed it.

When we started on our journey into containerisation we started with Triton from Joyent. It did’t work out for us, largely around the networking side of things and frankly because we don’t know what we’re doing with Solaris. That doesn’t diminish my Cantrillian man-crush however and Brian Cantrill (Joyent CTO) and team have done some awesome things. You may have heard of Node.js, a project Joyent sponsors, but they’re also responsible for AutoPilot and maintain a great resource at http://autopilotpattern.io along with ContainerPilot – a great helper script to do this for you. ContainerPilot didn’t work out for us, but it is a perfect place to start and will be fine for less unique requirements.

Battle-tested

Following the migration of voice services to our new stack, we’re almost 100% on our containerised architecture described. Master databases, including Redis, were an exception but that was changed last weekend. Our ‘legacy’ architecture (which is way more modern than some others anyway!) is largely deprecated save for a few final legacy services. We’re so pleased with the move and the flexibility and workflow benefits it gives us. The team have done a great great job.

However, after a quiet weekend in which we migrated our master Redis, we faced some serious issues on Tuesday. Our master volumes had ballooned to a huge extent due to an undetected behaviour change in master version (between our old legacy master and the new containerised replacement) and the slaves were lagging. They were lagging only because the master was so busy in a loop doing full dumps repeatedly to satisfy the needs of legacy slaves (in place for the handful of legacy services) which neither supported partial replication from the backup, nor the newer file version.

Considering our new stack hosts have 128GB of RAM minimum (some repurposed hosts have 300GB+), the massive database inflation wasn’t in itself an issue; the master running out of RAM would have been and that would have happened on the legacy system. However, if we hadn’t changed the master in the first place the problem wouldn’t have occurred.

The larger note here is that the new architecture and AutoPilot made what would have been previously catastrophic, essentially invisible to customers, and a lot easier for us to resolve! It was far from perfect though so please don’t take this as an “aren’t we great” post. Some things didn’t work as they should have done, and some customers saw some call rejections some of the time as nodes were synching up as was noted on our status page. We learned lots about what more we need to do but it gave us helpful feedback that we’re on the right track. It is ironic that only a few hours before the issue became apparent we’d posted notification of testing the destruction (and automatic recovery) of our Redis master for a few weeks time!

R&D

We spend a huge amount on our R&D, intent on making our service as highly available, feature rich and performant as it can be. After all, one consequence of wholesale is that each of our customers has lots and lots of satisfied or angry customers when we get it right or we get it wrong. That responsibility isn’t wasted on us and that is why R&D as an aggregate is our largest cost in the business. I worked out the other day that we probably spend more as a percentage of revenue than anyone else in the industry and, more alarmingly, we spend more in absolute terms than one of our more ‘radioactive’ contemporaries. They’re many times our size and those magic boxes I think they buy in the name of innovation cost a lot of money. So that’s a pretty amazing fact!

***

I hope the above was useful and look forward to hearing about your own adventures. Meanwhile, I’ll be speaking about this at forthcoming industry events if you’d like to know more.

Our Architecture and the AutoPilot Pattern

Related posts

Teams and our strategic vision

Test and Trace Spoofing

Thanks for an interesting year!