To control something, you must first be able to measure it. This is one of the most basic principles of engineering. Once there is measurement, there can be feedback. Feedback creates a virtuous loop in which the output changes to better track the changing input demand.
Improving data centre efficiency is no different. If efficiency means better adherence to the demand from the organisation for lower energy consumption, better utilisation of assets, faster response to change requests, then the very first step is to measure those things, and use the measurements to provide feedback and thereby control.
So what do we want to control? We can divide it into three: the data centre facility, the use of compute capacity and the communications between the data centre and the outside world. The balance of importance of those will differ between all organisations.
There are all sorts of types of data centres, ranging from professional colocation data centres to the server-cupboard-under-the-stairs found in some smaller enterprises. Professional data centre operators focus hard on the energy efficiency of the total facility. The most common measure of energy efficiency is PUE, defined originally by the Green Grid organisation.
This is simple: the energy going into the facility divided by the energy used to power electronic equipment. Although it is often abused, a nice example is the data centre that powered its facility lighting over POE, (power over ethernet) thus making the lighting part of the ‘electronic equipment, it is widely understood and used world-wide. It provides visibility and focus for the process of continuous improvement. It is easy to measure at facility level, as it only needs monitors on the mains feeds into the building and monitors on the UPS outputs.
Power efficiency can be managed at multiple levels: at the facility level, at the cabinet level and at the level of ‘useful work’. This last is difficult to define, let alone measure and there are various working groups around the world trying to decide what ‘useful work’ means. It may be compute cycles per KW, revenue generated within the organisation per KW or application run time per KW and it may be different for different organisations. Whatever it is, it has to be properly defined and measured before it can be controlled.
DCIM (data centre infrastructure management) systems provide a way to measure the population and activity of servers and particularly of virtualised machines. In large organisations, with potentially many thousands of servers, DCIM provides a means of physical inventory tracking and control. More important than the question “how many servers do I have?” is “how much useful work do they do?” Typically a large data centre will have around 10% ghost servers – servers which are powered and running but which do not do anything useful. DCIM can justify its costs and the effort needed to set it up on those alone.
Virtualisation brings its own challenges. Virtualisation has taken us away from the days when a typical server operated at 10-15% efficiency, but we are still a long way from most data centres operating efficiently with virtualisation. Often users will over-specify server capacity for an application, using more CPU’s, memory and storage than really needed, just to be on the safe side and because they can.
Users see the data centre as a sunk cost – it’s already there and paid for, so we might as well use it. This creates ‘VM Sprawl’. The way out of this is to measure, quote and charge. If a user is charged for the machine time used, that user will think more carefully about wasting it and about piling contingency allowance upon contingency allowance ‘just in case’, leading to inefficient stranded capacity. And if the user is given a real-time quote for the costs before committing to them, they will think harder about how much capacity is really needed.
Data centres do not exist in isolation. Every data centre is connected to other data centres and often to multiple external premises, such as retail shops or oil rigs. Often those have little redundancy and may well not operate efficiently. Again, to optimise efficiency and reliability of those networks, the first requirement is to be able to measure what they are doing. That means having a separate mechanism at each remote point, connected via a different communications network back to a central point. The mobile phone network often performs that role.
Measurement is the core of all control and efficiency improvement in the modern data centre. If the organisation demands improved efficiency (and if it can define what that means) then the first step to achieving it is measurement of the present state of whatever it is we are trying to improve. From measurement comes feedback. From feedback comes improvement and from improvement comes control. From control comes efficiency, which is what we are all trying to achieve.