Network Outages are going to happen. The marketing department talks about Zero Outages -- and that's a great goal to have. But as the pragmatic engineering and operations team, you can prepare for outages to prevent them and to remediate instantly.
- Skip the Pretend Redundancy
- Detect and Triage Faults
- PCAP: Packet capture
- Ready for Remote Testing
- Logging Enabled & Synchronized
- Prepare the humans
Skip the Pretend Redundancy
Everybody knows you need hot spares. Pretend-Redundancy is what you get when you spend twice as much on equipment and software but never test it. Salesmen love it because they sell everything you need to be reliable.
Many companies buy spare equipment and even plug it in, but actually implementing working redundancy is far more challenging. Proper physical design can be elusive, and testing time-consuming. But it's worth it to have a robust network. Outage-Prepared Networks not only build redundancy, they test their redundancy.
Redundancy in networks means that the system:
- Has adequate replicas (e.g., spare servers, spare routers)
- Continuously synchronizes all necessary information between replicas
- Automatically detects faults
- Automatically takes the faulted component offline (so that traffic doesn't keep going to the faulted component)
- Automatically selects a replacement component replica
- Causes the replacement component to become active or start to take the additional workload
Detect & Triage Faults
To prepare for outages, you've got to be equipped to know when the faults are occurring.
- Get the basics: Get the status and health information your devices have to offer using SNMP. Is the Interface up? Get alerts when components go offline.You can this fault detection and reporting with a variety of systems, including OpenNMS, SolarWinds, and others.
- Measure the normals: Beyond the basic good/bad readings, you can define Key Performance Indicators (KPIs) like number of active users, concurrent calls, CPU and memory usage, and set thresholds for what is normal or abnormal. CPU of 50% may be normal, but 90% will be normal for some workloads and abnormal for others.
- Get the alerts to a human. Generating alerts isn't enough: some responsible human has to get the alert and do the right thing! The human analysts also need an escalation path to alert the problems-solvers.
- Analyze. When you get an alert, you need to troubleshoot it to determine its severity. Fault-detections are imperfect, so you need to determine how true the alert is, and triage the situation for seriousness. Few automated alerts are reliable enough to be trusted.
In the June 2018 Visa Europe outage, the problem was bad hardware. The Network Operations team needed to contact people who could travel to the equipment site. The network monitoring analysts need a reliable communication path to other teams.
PCAP: Packet Capture
Outage-Ready Networks have packet capture capabilities online throughout the network -- at least at the low-speed disaggregated points. Capturing 100 Gbps Ethernet may be impractical today. But capturing the traffic going in and out of each 1 Gbps load balancer should be achievable.
Capturing traffic should not require a dispatch to a physical site: the system should be built to allow troubleshooters to activate packet capture, collect and analyze data within minutes.
Ready for Remote Testing
The Operations staff need the ability to test remotely to replicate problems remotely.
- Voice (Phone) services -- It's easy to REGISTER a SIP / IMS phone over the Internet, or to build VPNs, to replicate the experience of users attaching at different points in the network
- Web services -- It's easy to use VPN or DNS to route traffic to a particular entry point in the network
- Local ISPs -- If you have a global service, using advanced BGP or DNS based routing, you can setup service with local ISPs in your markets and provide test units there to let you test the experience of local users in those markets.
After the outage has begun, it's usually too late to build Remote Testing capability. Outage-Prepared organizations build these capabilities in advance.
Logging Enabled & Synchronized
All of our systems have logging. In telecom, though, many of those logs are disabled automatically. To prepare for outages, enable the logging you can, and be sure the logs are useful.
- Enable the logging. Turn on the logging you can afford to enable without crippling the device. Debug logging is often too much.
- Synchronize the clocks. To find out what truly happened in a problem, you often need to know the sequence of events that led up to it. But so often the clocks in systems are not synchronized, so that uncovering what occurred fast enough to remediate an outage is too difficult.
- Centralize the logging. Whenever possible, aggregate all your logs to a central location bearing in mind you'll lose the logging locations as well. Centralized logging with an analyzer tool like Splunk can radically reduce remediation time, but if your centralized-logging sites are down or unreachable, you still need to troubleshoot the problems. I prefer systems that store some logs locally, and send a copy to the centralized log storage.
Prepare the Humans
To prepare for an outage, you need all of your staff ready to help. Too many organizations depend on their senior most engineers for outage troubleshooting, but this is a big mistake. You need your full staff prepared to triage and analyze hard problems.
Training should focus on how to determine whether each component in your network is working. Components can be servers (like www2, or the DNS server 184.108.40.206) or services (like the SIP SBC at 220.127.116.11 or the REST API at https://foo.com/api/v2/).
The 24x7 troubleshooting staff should have:
- List of supported product. (Supported means that if it breaks, we have to fix it.)
- Diagrams of how the product works, for purposes of understanding dependencies.
- List of components involved in each product
- Method for testing each component. E.g.,
- REGISTER with SIP to sip:email@example.com and make a call to sip:+firstname.lastname@example.org;user=phone
- curl https://foo.com/api/v2/get-some-good-stuff | grep good-stuff-v2-
- Expected outcomes: How to interpret the test results
With a training and preparation regime, the staff monitoring the network will be able to isolate
Don't rely exclusively on your senior technical staff to support you in outage. Train the whole staff in testing