By Simon Woodhead
I’m afraid we have a confession and we wanted you to hear about it from us, whether or not you subsequently hear about it elsewhere, because we let ourselves down.
There’s some lessons from it we can all learn, and significant improvements that can be made as ultimately all that matters is when someone calls 999, it’s because they need help and we all need to ensure they get connected to it.
In June 2020 we experienced an outage on our BT circuits facing two BT sites from our Telehouse East location. This was described by BT on reporting as a “transmission error” as those circuits did not show as down but were not passing calls.
This resulted in us rejecting 21 calls to 999 with a 503 error. These may have been retried to another network by our customers of course, although judging by the CDRs it appears there were 6 callers making 21 attempts. Anyway. 21 calls represents, at most, just 0.0055% of the overall calls otherwise completed across the network at the time.
None of our direct to end user traffic was affected at all.
We have an obligation – as do our peers – to report incidents to our regulator, Ofcom and we did so. That reporting process is still on-going and it would be inappropriate to divulge any detail at this time, however, that does not mean we cannot share the learning in the hope that we all ensure we do our best in getting help to those whose circumstances are so dire that they have to call the emergency services.
I will explain shortly what went wrong and what we learned from this but one reason for writing this post now is that we can see there is a lot wrong with the way 999 works, mistakes we made, and learning we’re keen to share with our customers and the wider community.
We want to do that regardless of any action Ofcom may or may not decide to take and, frankly, hold ourselves to a higher standard than those we would freely criticise were the roles reversed. I should stress our lessons and remediation were near immediate (in fact, mostly within the 3 day window in which we were obliged to report) and despite similar triggers subsequently, the incident has not, and cannot, recur.
If we were an airline this would be standard practice and welcomed in improving everyone’s safety, but it seems that is not generally accepted in telecoms. Peter Farmer did a blog on this and the fantastic book ‘Blackbox Thinking’ a few years ago.
We have significant diversity in our routing to BT, which is the path for 999 calls as they operate the Emergency Call Handling Authority. Switches across three Simwood sites face BT switches in 5 sites and these are fully meshed such that in the event of an outage there is simply a drop in capacity and calls flow another way. That was evident on the day in question with 99.9945% of completable calls doing so. Had there been an issue there we would have known due to monitoring.
The issue here was that 999 was not following that normal call flow. This was due to a well-intentioned design treating 999 differently. In trying to eliminate points of failure, we’d specifically routed 999 in a way that failed quickly with a graceful SIP 503 rather than bouncing around the network and failing in a way which could have resulted in a timeout to the end-user. In putting 999 on a pedestal this way, we’d introduced a weakness which manifested here. Add to the fact we were monitoring for circuits which were down, rather than those which weren’t up, we gave ourselves no visibility of this softer “transmission error” between BT and ourselves.
Of course, we remedied all of that very quickly afterwards. We took 999 off its pedestal and made it route without special logic, save that which is required on the TDM side. This way issues are far more observable. We improved the monitoring such that we’re now alerted to the softer errors on the BT interconnect, and we went a stage further still so that every single failed 999 attempt is now raised as an alert, even if the failure is the caller hanging up, or the call completed in a subsequent attempt. And we ensured massively more utilisation of the dense mesh of routes into BT as well as adding a fall-back route over a peer provider. So we’re confident that it cannot recur here but we’re not at all happy with where the market has evolved with 999 at a wider policy level. I will return to that!
Meanwhile, we await our fate from Ofcom but want to highlight our understanding of the legal position for our customers. The legal obligation is to ensure “uninterrupted access to Emergency Organisations as part of any Publicly Available Telephone Services offered”.
Publicly Available Telephone Services, or PATS, broadly means making a service that can make or receive telephone calls based on numbers in numbering plans available to the public. That definition pretty much applies to all of our wholesale customers – and, just to deal with some old myths still out there, there’s no real exemptions any more and haven’t been for a while.
You, or your PATS providing customer, have a strict obligation and therefore we’d counsel you to not treat 999 as an exception on a pedestal but to route it with the same redundancy you would other call traffic, i.e. send it to multiple sites on the Simwood network but also consider failing over to another operator; or fail over to Simwood if you have calls primarily going elsewhere.
We find customers tend to send calls the same way they updated the data, which whilst it keeps BT (and possibly other operators) happy, does not get help where it needs to be with as much certainty as it could. We’re not suggesting that you broadcast your 999 calls any which way but loose and treat this blog as carte blanche to avoid investigations into CUPID mismatches, but ask yourself the question ‘In this failure state, will a 999 call still connect?’.
Some networks may whinge and whine about network numbers etc. when you’re in a failover state, but we are pretty damn sure that you’ll get whinged and whined at a lot more by a regulator with infinite cosmic powers to make your life unbearable if you gave ‘number unavailable’ to someone in distress.
GCA3.2(b) does require us and you to maintain a functioning network so single VMs or single interconnects in single data centres are to be avoided; we’re confident in our architecture generally and 0.0055% of calls affected here doesn’t tick the catastrophic failure box in our view. We’d again question how many reportable outages experienced by our peers actually were reported – there have been no public investigations at least.
This pedestal thing is a concern and it’s a lesson we learned the hard way.
Test calls to 999 cannot be routinely made yet it requires distinct routing. It therefore gets pseudo-special treatment. I say ‘pseudo’ because whilst it is treated differently to other traffic, I don’t know that is a benefit – one can be blissfully unaware of issues that will affect it. On this occasion ordinary calls were fine, and 999 wouldn’t have received a 503 error code had it been treated like a call to Domino’s. If ordinary calls were failing, every network in the land would notice, but how many know and monitor the status of every 999 call attempted?
By way of example, for reasons some of you know but I tend to keep private, we call 999 a lot in our house. At one stage we called ambulances, often requiring the air ambulance, 3 times a week. 5 years on we still do so, thankfully, a lot less regularly. I’d therefore suggest that I’m a bit of an expert on calling 999 and can hand on heart (or in an Affidavit!) say that most of the dozens of calls attempted over my partner’s mobile have failed.
Ironically we usually end up calling over Simwood. I expect it is a WiFi calling quirk rather than them failing within the network concerned and so there’ll probably be no log of them, and certainly no report to Ofcom by them – or even a need for them to. Similarly, as a wholesale operator, we are aware that some CPs do not even appear to offer fully compliant 999 despite the legal requirement on them to do so; this begs the question – how many of their customers have sought help and how many reports have resulted from that? As an industry we don’t know what we don’t know.
Back when we used to charge for 999 service establishment as a wholesale service, we were concerned that literally none of our customers opted in. Over the years we’ve made it cheaper and now free to encourage adoption, but still some are not opted in. So on the one hand we offer it free and see less-than universal take-up whilst others still charge for it. So how many services around the country actually attempt to offer 999 in the first place? Surely the answer to that question is quite significant.
Once established, operators need to update a ghastly database within BT. While we enable that through our API, it is still an aberration of a process behind the scenes, especially where ported numbers are involved. And woe betide anyone who passes a 999 call presenting a BT number that isn’t ported to them – a CUPID mismatch. The officious emails we receive from them on a daily basis would suggest strongly that this shouldn’t be allowed and the person in need unwittingly doing so should not only have had their 999 call terminated but should be strung-up. This is wrong and we should be in a place of ‘connect first, worry later’ but that message is not landing in certain quarters – although we are grateful to have a team that are very adept at telling BT precisely what to do with this correspondence.
Lastly, in this IP interconnected world, with real-time location, why on earth are we uploading text files over FTPS (yes, SFTP would be too easy) to be batch processed by a mainframe showing where the caller might be some of the time? It is arcane. All calls from Simwood are marked as nomadic because we offer services to VoIP providers, so the CHA know not to trust that address entirely, but imagine if we could include some of the rich intelligence we have in SIP headers.
Imagine if apps could interrogate the GPS and not only pass through location coordinates but also altitude and direction of travel. When I was in Mountain Rescue I developed something like this whereby we could SMS a missing person and we’d then get a live update of all this from their phone; an app could embed that functionality subject of course to strict privacy controls. Imagine if rather than text files with daft abbreviations (and them being rejected from the database if you get the abbreviation wrong – better no address to send the ambulance to than the wrong spelling of ‘Limited’ apparently), an open API could enable rich location data to be populated as frequently as it changes? Moreover though, why are we relying on a daisy chain of providers to get a call in the hands of the CHA in the first place?
One look at how e911 in the USA works should put us all to shame in this regard. The USA has some seriously sexy capabilities.
The answer to that is perhaps money? BT operates the CHA and are keen to remain doing so based on recent consultation submissions; they recover part of the cost of doing so from the calling operators and so it trickles down the chain. Imagine instead if the CHA was operated by the NHS and accessible by any operator through a choice of medium? In the mobile world devices will connect to any available network; why not have this in the SIP world? OTT apps could send 999 to a central HA 999 proxy, falling back to one provided by their operator, the default becoming direct delivery of calls with no point of failure in between. Paying for that solution should follow the choice of solution, not dictate it.
I’m heading off-piste here, into territory we’ll address in respect of Ofcom’s recent consultation on video relay for 999, but the point is there is lots to be done in 999 to both know the extent of the problem and indeed to solve it. Some of this will require the regulatory environment to catch up with the evolution of the UK’s telecommunications infrastructure, some of it is just as simple as our customers taking a breath and asking ‘If I get a 503 from Simwood on a 999 call, what do I do?’
So in closing, we screwed up here and have genuinely beaten ourselves up more than anyone else could. We have righted our failings but wanted to share this experience in the hope of making the path smoother for others and opening a dialogue on doing things better. We can all do things better!