Last week I met with Dave, who used to run systems for one of the world’s largest and most successful financial firms. They have a $1bn+ IT budget and a huge amount of it is spent on storage. When Dave approached the whiteboard things got really interesting. In 4-5 quick squeaks of a pen he outlined the tidal wave in data growth that is only just beginning. And he did so in a way that really appealed to the former economist geek in me. Sometimes a good graph really tells the story.

So here it goes.

First the axis. For this graph we want time on the x axis and value on the y. We’re going to look at the relationship between value and costs at various moments in time.

And then let’s draw our first line, the value of an average piece of data over time. Data itself is only valuable as information. That’s an obvious point but often overlooked. So when a bit is written to disk it is just data but pretty quickly, on average, it starts to be used somehow and then, over time, it becomes less valuable. So you can draw this curve as a bit of a bell curve with a long tail (setting aside cases in which for regulatory reasons the data needs to be destroyed because it has negative value).

And then let’s draw our second line, the cost of storing that average piece of data over time. You can draw this as a line that slopes down and to the right. On the other hand, Dave reported getting exploited by his legacy storage vendors who knew that their business models had him locked in. If he fires his legacy vendors he would have to migrate petabytes of data, which was extremely expensive to do. So as a result sometimes his average cost per piece of data INCREASED over time. Nonetheless, eventually, over time, the price decays slightly to the right.

Now you have two curves. It seems pretty simple to fill in the space under the value curve that is also above the cost curve. This graph suggests that this space must be the time during which the data should be kept. After the point of intersection of the two lines the data can deleted since the cost of keeping it is greater than its value.

But now draw a line starting at $0 and increasing at some reasonable rate to the right. This is where it gets really interesting. It turns out that there is a linear relationship between time and the cost of deleting data. Over time it becomes more difficult to ascertain which data can be safely deleted. Dependencies pop up and the cost of deleting the wrong data – that which supplies the critical information necessary to back date financial algorithms for example – can be exorbitant.

When you add the cost of deletion line to the graph – depending on where you draw it of course – you see that it may not be worth it to delete data at all. And if you cannot delete the data and you are creating more and more of it at an ever increasing rate than you have data explosion. Here is the cost added to the value curve. As you can see, even with their then current cost of storage, they s/d never delete data.

Dave tells the story of a 3+ month project at his firm to delete extra data. They trained their teams and thoroughly and carefully reviewed their data. They were able to delete some 5% of the data they stored, which turned out to almost 1PB of data, which was millions of dollars at legacy storage costs. They were quite happy about this savings until they realized that they had spent the time of 15 people for approximately 4 months, or 5 man years to delete the data. They had spent nearly as much to delete the data as they had saved in storage costs and that was only in direct costs.

And now – one more line to draw. Draw in a cost line at 25% the cost of legacy storage and see what happens to your data usage. For those of you with a good eye or that remember your calculus, you’ll immediately get the answer – the amount of data stored explodes, so much so that the total amount spent on data storage will increase.

It turns out that data, being the raw material of information, wants to be stored by just about everyone for all sorts of reasons. If you bring the cost of storage down closer to zero incrementally you find that the demand increases exponentially. The end result is more storage spending because more valuable data is being stored on enterprise class storage in the first place and is never deleted. As a result the company will become yet more intelligent and successful.

So the corollary of Dave’s Data Deletion Dilemma is the myth of storage savings. The less you spend on storage per capacity, the more you will spend on storage – because enterprises’ demand to record and analyze information is nearly infinite. And the more you store the more sense you can make of the world.