Followers of our status page will know that there were some issues with our interconnects to/from BT recently. Those who don’t follow it hopefully didn’t notice! This post serves as a post-mortem as well as providing more information on how we have architected this side of things.
Whilst a full chronology of events is on our status page, in short, BT had an issue in Ealing (West London) on Sunday. This took out many of our circuits that connect from our Slough site to BT exchanges around London. There was no impact from this on our services because of the unique way we’re architected. BT fixed most of the issues by Sunday afternoon leaving a few very lightly used circuits dark but with no impact on our service. There were some lessons for us here but generally we were pleased with the outcome.
Coming into Monday morning, we realised we were artificially blocking some international traffic destined to BT over SS7. We have a number of exchanges we can put this traffic into but need to carefully manage the capacity to ensure room for incoming traffic and outgoing traffic that economically needs to favour that exchange. This is completely automated but is based on an assumed level of capacity. We’d reduced that manually on Sunday for affected exchanges and due to human error it was not re-adjusted. Sorry to the few customers who had affected calls and, of course, it was reset the moment we noticed.
Throughout Monday, the other circuits remained down but our service was normal. We were working with BT behind the scenes and they asked if they could replace some equipment in our racks (see below for why BT have equipment in our racks) which would have the consequence of taking all of our circuits leaving Slough for BT out of service. This was kind of a big deal, especially given it had to happen during peak times. We are architected for that scenario but it was to be a fairly extreme test. It thankfully went without incident and our architecture described below was (again) proven. It did not, however, restore service to the remaining circuits.
We’re pleased to report that the remaining circuits were brought back into service late on Wednesday (yesterday) but we have yet to have a formal reason for outage from BT. The only notes they apparently have on the system simply say “service restored”. Hey ho.
Why it went smoothly?
This incident had the potential to be a 3-4 day outage but it wasn’t because of the way we do things. The architecture proved itself and the way we do things differently to the norm was validated. It has been proven internally before with less-significant outages on the BT network, but we have never had to shut 50% of the interconnects down during peak hours before. There are inevitably learning points but they are mostly around internal process and expectations from BT. The cautious approach that under-pinned our architecture also, however, wants more redundancy still and internally we’re asking ‘what if’ starting from the position of 50% of the network shut-down. It shouldn’t be forgotten though that that is the position that many of our competitors start from.
What we do different?
There are two fundamental differences in the way we interconnect to how we believe our competitors do, firstly though lets dispense with a common misconception:
We do not use BT IPX! BT IPX is a BT managed service providing IP to/from BT which, in turn, connects to the PSTN. We provide IP to and from the PSTN which includes a regulated mutually owned interconnect to BT but crucially includes other network operators too. We’re happy to have a conversation around how we compete with BT IPX but to be clear we do not compete with BT IPX resellers, our customers do. They do it very well! Sadly there is a subtlety of words and understanding there that is often abused and even seems to be believed – a recent letter claims we use the “old” way of connecting to BT, and IPX is the “new” way and as high as one can go in the food chain. That is wrong, plain wrong. BT IPX is a product of BT, and an SS7 interconnect is a regulatory requirement on BT. There is no regulated IP interconnect to BT presently, however some may choose to present it to themselves and externally.
So, back to what we do differently to actual competitors who have an actual regulated interconnect with BT and how that helped us here.
Firstly, there are two ways of physically interconnecting. The one we use is where BT manage connectivity right the way to site and the demarcation point is actually in our racks. On this occasion, that meant wherever the issue was – unless it was our switch, or any cross-connect in our rack – responsibility for fixing it was theirs. Putting aside their many failings, once an issue gets into the hands of the engineering and SS7 network management team, they’re very good.
The other way would be to connect in-exchange and for anyone other than the biggest, that means using 3rd party connections to and in the exchange. Thus, the connection becomes indirect with a third party transiting traffic in between. This is much much cheaper both in terms of the virtual circuits to BT (as they’re just piggy-backing on other physical capacity), and the cost of transit to there. Therefore, if we wanted to send our customers emails gloating about how many points of interconnect we had, or our economy was Premium Rate business (so we needed to get closer to the source of traffic on the BT network), this would be the way to go. However, it also means there’s another link in the chain, another entity to convince of an outage, and another set of processes to navigate to get something done about it. If we worked this way we’d find ourselves having to blame our “network provider” but we could tell you about lots and lots of points of interconnect! We provide service to serious ITSPs who in turn service SMEs and Enterprise so a third party having potential to affect that service just isn’t acceptable to us. We have unashamedly few (relatively) points of interconnect but they’re authentic and they’re engineered to give maximum availability!
Now, to an IP network engineer or anyone who has grown up in an IP world, the way voice calls are routed across the BT network will seem bizarre. Forget any notions of BGP and route-failover or even DNS. Each type of number range hosted on the network has a route defined, which we can manage through form-based requests to BT. It dictates which BT exchange that we connect to calls will be routed to. It has both a primary and a failover exchange for each area of the country such that calls to, for example, our 03 range will go to exchange A first and exchange B second if they come from Scotland, and vice versa if they come from Wales. The common way to architect things therefore would be to have switches connecting to the exchanges that are close to them. You might think that in this scenario if the switch connected to exchange A failed, BT would send traffic to the switch connected to exchange B. That, unfortunately, is not the case!
In this example, should BT lose exchange A they would overflow traffic to exchange B. The same would happen if they had congestion on their network. However, in the event of us losing the link be it through a JCB pulling up a fibre, or a power outage in the data-centre, we understand they would not. In this circumstance, the traffic is already at exchange A and would have nowhere to go. Calls routed that way would thus fail! This is the typical way to architect it given in the event of calls failing one can telephone BT and have them manually alter each route. Now of course, the more switches one has, the smaller the geographic area affected but the fact remains there would be an outage and it would have to be manually rectified. This isn’t acceptable to us.
Instead, we reach every point of interconnect from (presently) two switches; one in London, one in Slough. Our routing plan for BT instructs them to send traffic 50/50 between the two. Where this differs to the normal approach above is that should we lose a switch, a site, or a link there is no disruption at least to new calls. This weekend’s issue off our network could easily have been our issue, but the result would have been the same if we were architected that way – 50% of calls would have failed. Given our approach, BT simply see a drop in capacity on affected exchanges, but we’re still connected there and thus calls still complete. On Sunday rather than 50/50, 100% of calls were flowing into London from all the same exchanges they normally come from without us lifting a finger.
Of course, this is twice as expensive even assuming the exchanges we connect to are equidistant between sites, as we need to maintain enough head-room to accommodate all traffic in a single site. In practice, any saving through connecting to an exchange closer to one of our switch sites is offset by it being more expensive to connect from the other. It’d be much cheaper to connect only to the closest exchanges – but this isn’t a compromise worth making to our mind.
It would be very easy to rest on our laurels or even gloat, but nothing is ever perfect. There are lessons to learn and they are mostly human but it serves as a reminder that the redundancy we seek to achieve is imperative not optional. We’re growing at an unprecedented rate and so are some of our customers. Whilst it is very easy to provision channels on the IP side of the network as requested, we need to be mindful that capacity on the PSTN side is finite, takes a long long time to put in place and is very capital intensive. In the not too distant past we’ve had the opportunity to triple the size of the business in a single day due to others not giving customers the capacity they’d paid for and that, of course, is hugely tempting with 50%+ headroom sitting there. For this reason we need to be quite disciplined in saying “no” to requests for capacity that isn’t truly needed. We also have an internal project to ensure that in the event of capacity being reduced, the customers who get capacity first are those who are paying for it and have committed to us as their primary carrier – we see traffic balloon when others have issues and it is wrong for loyal customers to be anywhere other than in front of those who have capacity with us for that eventuality.
We’re also asking the question “is two enough?”. Should we have three connections to every exchange? If we do that do we keep 50% headroom or can we reduce it to 33%, thus still tolerating one site failure? Whilst we connect to every exchange from pretty much opposite sides, it is not inconceivable that an incident could affect both sites and possibly even all three. Should we therefore consider the more normal manual failover but to far distant exchanges, e.g. Manchester, not instead of the way we work, but as well? These are all questions we continue to ask ourselves so hopefully you don’t have to.
Finally, we said right at the start that BT is not the PSTN, they are part of it, and therefore our strategy has also been to interconnect with other operators too wherever possible. If traffic originating off-BT can come to us directly that is the best kind of redundancy we can build into the BT interconnects, at least as far as that traffic is concerned. There are a whole raft of questions that come into play there such as how and where that provider has its own switch sites as we could very easily have double our side to a single point of failure theirs. We are actively building other interconnects and it is one of many areas that differentiates us from IPX – IPX is IP to/from BT and then the PSTN, we’re IP to/from the PSTN including BT.
There’re good reasons we do all this as hopefully you’ll appreciate. Naturally, we’re always here if you have any questions.