There are many rumours and perspectives about Google’s data centres touted by data centre operators, executives, media, and analysts. I’d like to clear up a number of myths or inaccuracies about Google’s operations. This is based on my own experience, discussions I’ve had with Google employees, and interaction with well-known experts in the data centre ecosystem.
While, on the surface, what Google is doing looks like it could transfer over as a best practice for your data centre, that’s not always the case. Google runs their data centres optimally for their business?to deliver content that generates advertising revenue.
What’s important is that you focus on how to run your enterprise data centre optimally for your business, not Google’s. This comment always receives an “Amen” from data centre professionals when I give talks at industry events, because it isn’t fair to compare apples (Google’s content delivery) and oranges (enterprise applications).
As this article points out, your goals and Google’s are not always aligned. While you focus on availability and reliability, Google prioritises cost control over availability in most cases. Without further delay, here are 5 myths about Google’s data centres.
Myth #1: Google’s business critical applications and advertising systems run in PUE 1.2 content-delivery data centres.
This is probably the biggest myth out there. Google runs two types of IT systems: content delivery and critical business services. Let’s take a look at the goals that define how Google runs these two types of systems.
First is content delivery, which is a homogeneous system of hardware and software that runs MapReduce over Google File System. This is where all the data for YouTube, GMail, Google Apps, etc lives. The content delivery system needs to be mostly available, but Google has provisioned such that some outages can be masked with redundancy and other outages can be solved with some apology messages.
This homogenous environment can be run at the limits, because availability is not the #1 requirement. The content delivery system is the “cost of goods sold (COGS)”, or the cost of Google doing business. Minimising cost maximises profit. These are the very large facilities with very low PUE.
Critical business services include Google’s internal systems that keep the company running day-to-day (customer management, HR, etc) as well as their advertising system, which serves advertisements and collects money. Without these systems, Google as a company doesn’t exist.
These systems are heterogeneous, running different software packages across a wide array of hardware inside of a conventional facility. Running these systems at their limits could jeopardise the ability to conduct business and collect revenue, so availability is tantamount. These conventional facilities use best practices and likely have a more moderate PUE between 1.5 and 1.9. Google doesn’t disclose information about these facilities, because they don’t have a sustained power draw of 5MW or more (so you never hear about them).
Myth #2: Google uses PUE as their primary metric to manage their data centres.
While PUE is an important metric to Google, it is one metric in a family of metrics that lead to lowest cost of content delivery. Engineers at Google tell me that, for each of their “business units” (such as YouTube, GMail, etc), they evaluate the profit per unit of content. Think of it as comparing the revenue generated versus the cost of delivering the content to generate that revenue. I applaud Google for this metric, but wish that they would publicly admit that it is really how they manage their IT infrastructure.
Think about this way: if you are constantly evaluating the business metrics (not technology or infrastructure metrics), then the realm of possible ways to increase business value is higher. Changing the way you do things is not limited to a particular technology or infrastructure, instead you can redesign the software (ala MapReduce and GFS), you can redesign the hardware (ala single DC voltage and backup batteries), and you can redesign the facility (ala containerisation).
All of this work is in the interest of lowering costs and increasing revenue per unit of content. Oh, and by the way, when you make all of these changes, your PUE goes down too. Why? Because the last thing you want to do is spend money on overhead in the cost of goods sold. You’re paid for the IT output, so the business metrics naturally maximise for increased profit.
Myth #3: Google uses renewable energy to power their data centres.
While Google does use renewable energy to power their facilities, these sources are not currently used in Google’s data centres as any meaningful power source. Even the most progressive solar designs (at Emerson, not Google) provide a paltry 16% of the data centre’s power. And solar has the added problem that there’s no power when the sun goes down.
When Bloom Energy revealed the Bloom Box, they noted that Google has been testing the system for 18 months. The test was at their Mountain View headquarters, and they found the Bloom Box to be 98% reliable (available). While this is a great step forward for fuel cells in scalability and reliability, one 9 of reliability simply isn’t sufficient to power any data centre. Many journalists, when they found out that Google was a customer, immediately jumped to the conclusion that Google must be using it for their data centre. No, not true, as Data Centre Knowledge quickly pointed out.
Myth #4: Google’s battery-on-server technique provides a more robust power backup solution.
Google’s server design for their content delivery data centres includes a full 12V system (no 3V or 5V components) with lead-acid battery backup (instead of a central UPS). The battery is said to power the system “for a few minutes” during an outage, after which the backup generators should be running and supplying power. Google said at their Data Centre Efficiency Summit, “if the generators don’t kick in within a few minutes, you have bigger problems and better have a fail over strategy.”
Generally this is true; if your generators don’t kick in within a few minutes, you are going to have bigger problems. That’s why it is important to test them regularly, and familiarise yourself with their operation. Continually evaluate whether the generators are appropriately sized for today’s IT load.
This gets back to availability versus efficiency; Google again chooses cost efficiency over availability, and the system-wide design of their homogeneous software architecture enables this battery design decision. Conventional UPS systems can power a data centre for an hour or more, and battery systems can be extended centrally to provide more runtime.
The battery-on-server system cannot be extended without replacing batteries on every piece of equipment or waiting for a refresh cycle. It does, however, provide a distributed battery backup that eliminates the single point of failure (central UPS) in conventional designs.
The batteries used in the Google design are 3.4AH 12V sealed lead acid. Based on a 3.4A discharge rate (roughly 350W), the battery voltage and charge drops below a usable level after 6 to 12 minutes. Note that Google has to go with the 3.4AH battery and not use one with higher capacity because the higher capacity batteries are too large to fit in a 2U physical configuration. The 3.4AH battery is 2.36″ high, plus wiring and terminals, and thus nicely fits in the 3.5″ 2U height.
Myth #5: You should be held to the same standard as Google when running your data centre.
Let’s face it, Google’s content delivery data centres run a single application across a homogenous physical infrastructure. While this is much more possible with new builds, existing data centres have such a wide array of equipment that these types of industrial-sized efficiency techniques are infeasible. Furthermore, your data centre runs ERP, CRM, HR, transactional, and Web applications?to name a few. These applications have varied architectures, and service, availability, and performance requirements.
To achieve the same level of efficiency as Google in your data centre is a noble goal, but ultimately you need to get the best performance for your data centre. This means metrics that map to the business needs that your data centre fulfils. Just as Google uses profit per unit content served, you must identify the right guiding metrics to run a lean, mean operation.
While Google’s content delivery data centres perform very well for the task that they perform, they are not apples-to-apples comparable to a business-critical enterprise operation. Manage your team and communicate to executives the metrics that make sense, because the last thing you want to do is get into a debate around “my PUE is better than yours” and “why don’t you have the same PUE as Google” when the service you’re providing is so vastly different than the one provided by Google.
There are more myths than just these 5, of course. Let’s start a dialogue about how to best run an enterprise data centre, not an industrial content delivery system, and develop best practices to optimise for the enterprise.