I find a large amount of misperception around how systems respond when something fails. We all understand the self evident term “single point of failure.” A “high availability” storage system always makes every element redundant to maintain data availability in the event of a failure. Most everyone gets this, but there’s more to the story.
Let’s say you are in a two engine plane and one engine fails. Are you safe? At the risk of sounding like a lawyer, the answer is, it depends. A two engine plane can safely land with one engine sometimes, but not all the time. If the plane is loaded to or beyond its maximum gross weight, it may not have much time in the air if one of the engines fails. The load on the remaining engine may be too great for the plane to remain in the air for long enough to mitigate dire consequences.
There are thousands of examples of this all around each of us in this technologically sophisticated world. Will the 3 surviving tires on your grape-hauling trailer work if the 4th one blows out? Well, yes, if each of the four wasn’t loaded to the maximum load limit. If you picked up 2 tons of Cabernet, and the farmer gave you a bit extra because he likes you and you brought him some good wine, the 3 tires left might not carry the load for long. Maybe less than, say, 14 minutes. Not that this has ever happened to me on Highway 101 between Paso Robles and Salinas, near the Bear Road Cutoff. Not even close.
Recently EMC took it in the shorts in an article by Beth Pariseau. Believe it or not, I think it’s unfair to lay the blame solely on EMC. Oh I know nobody expects me to rush to EMC’s defense, but hey, we are all wrong sometimes.
As explained by EMC, they had a software bug. So far this doesn’t make them unique. All machines have them, unless they don’t have software. If they don’t have software, then a tire blows. It’s the way the world is. Next, the failed controller “panicked” and took itself offline. Just as it should, the other controller picked up the load. Unfortunately, the load it took over when added to its existing load was greater than 100% of its capability. Soon, everyone will be out of the car staring at a trailer full of two tons of grapes wondering what went wrong. Grapes are heavy and it quickly becomes obvious that stuffing them in your pockets won’t work. Time for Plan B.
Most people are amazed that we even heard about an EMC failure. Several of my industry friends remarked that the cover-up must have gone awry. Nobody likes bad press, and especially one that’s not entirely your fault, and EMC usually does a good job of “handling” a sticky situation.