Q: What service-level agreement (SLA) does your networking client expect?
Everyone would like a network that was up 100% of the time, but no one can really afford that. Getting that last 1% is incredibly expensive. A network that's up 99% of the time is actually a pretty easy thing to achieve -- that's 15 minutes of downtime each day or a small blip of an outage every hour and a half. So it's important to set expectations that no network is perfect, but there are certain things that you can do to improve uptime.
The difference between something that's up 99.95% of the time and something that's up 99.96% of the time is difficult to design around. We don't have that kind of granularity in network engineering.
Let me break [service-level agreements] down into a couple of different buckets. The first bucket is one-day return to service. This is a problem that might take a day to fix. Network equipment vendors often offer a service contract that lets them replace a part within four hours. That doesn't mean that the outage will last four hours, because it often takes a couple of hours to diagnose a problem, four hours for the part to arrive and a couple of hours to install it. But for some networks, that's sufficient for what they need. It's not the solution that most people want, especially if the company has maybe more than 50 people or more than three or four locations.
The improvement would be the next bucket, which is called N+1 redundancy. This means that any one component can fail and the system keeps working. So to achieve that, you have to have redundancy in the network. For example, a router might need two fans to keep it cool, but you would buy a model that has three fans -- any one can fail and the system can keep running. That's what's called N+1 redundancy -- the N is what's required for the system to keep running, and the 1 is for redundancy.
Most equipment nowadays, especially networking equipment, is assigned with all N+1 redundancy on the internal parts, and that can really improve the service-level agreement, because now you're in a situation where a part that fails does not automatically equal an outage. If it's one of the few parts that isn't redundant -- for example, if there's only one network connection between point A and point B -- all the fans and CPUs in the routers could be redundant, but if they're connecting a single point between two buildings, and that link goes down, then you're going to have an outage.
So the third bucket is system-wide N+1 redundancy. That's where we have redundancy not just on the internal links in the equipment, but for all the network links also. For example, you'd have dual network connections to a wiring closet or between offices. Especially if you're going between offices, it's important that the two connections are diversely routed, so that one backhoe doesn't ruin your whole day.
Service-level agreements can be even more protective that that, but usually [additional] requirements like that are from companies that engineer their own solutions.
Lastly, there are hybrids. So for example, a company with many sites will have a high service-level agreement for their medium and large offices, where everything is redundant. But for the smallest of offices -- maybe they have dozens and dozens of offices with just one or two people, maybe sales offices, scattered all around the world -- often you'll see a different service-level agreement for those offices, where if the router dies, those people are just going to work off the Wi-Fi from their local Starbucks until the office can be brought back online.