The Most Common Single Point of Failure in a Data Center

After over twenty years of working with customers in their data centers, I’ve seen virtually every piece of critical hardware be configured in some sort of redundant fashion to protect against item failure. Storage usually has some sort of erasure coding, servers are clustered in a virtualized solution, and multiple switches are set up in a single management fabric – just to name a few.

And yet, all too often, those redundant components all end up plugging into… the same UPS.

Two power supplies for a server? Plugged into the same UPS. Perhaps different PDU’s, but they ultimately end at the same UPS. Those redundant switches? Plugged into the same UPS.

I suppose part of the problem is the name itself – uninterruptable. Perhaps that gives a bit of a false sense of security. I acknowledge that most UPS are configured to fail open, letting current pass through to the devices in the event of (some) UPS failures. Yet, a 2016 study concluded that “UPS system failure continues to be the number one cause of unplanned data center outages.”

So much for uninterruptable, I guess.

The most common situation I see where this practice is followed is in the small remote office. A single (half) rack of equipment in a closet, and a single 8 or 12 outlet UPS for the entire stack of equipment. However, I still see it a lot in small data centers as well. There may be multiple racks, each with a UPS at the bottom, but everything in one rack connects to the same UPS at the bottom. Fortunately, if you happen to be in this situation, the simple fix is to unplug one leg of power from a device and reroute the power cable to another UPS in another rack.

What about large data centers? I often see separate PDU’s in these cases, and every so often, devices are plugged in such that they are relatively balanced between three phases. While that is all well and good, many times these PDU’s still go back to a single central UPS.

Obviously, there are limits to what one can do to provide redundancy. The power into a building is what it is. The power plant that provides electricity is what it is. Those things are out of the control of an organization’s IT staff. However, what I’m talking about here are the things that are in the control of the IT staff. Got some equipment going into a small closet at a remote site? It’s only a few hundred dollars to get a second UPS (at that point, even a consumer-grade one is probably an improvement). Got a small data center room with a few racks? Get a UPS for the bottom of each rack and cross-connect devices’ power connections. If you have a large data center with many racks, you may already be planning to purchase multiple UPS devices. If so, make sure each PDU down the side of a rack is connected to a different UPS.

We like to think that an outage is caused by a single server or device failing. However, that is often limited to just one, or maybe a few, applications. Annoying? Yes, but that can often be recovered from relatively quickly with limited business impact.

The ugly truth is that a UPS failure can (and does) cause much more massive, and serious, outage scenarios. So, sleep soundly knowing that you have a UPS that has your back – but next time you visit your data center, take note of how many devices are relying on a single UPS. You might be surprised.

The Most Common Single Point of Failure in a Data Center

Recent Posts