Ever since the birth of the ‘big data’ boom, the consensus has been that size is good, and bigger is better. In one sense, this reasoning is fair enough: after all, ‘big data’ is predicated upon the ability to analyse the huge volumes of information that organisations now produce. But the seductive claims made by big data – better insight, real-time analysis, more accurate predictions – hide the fact that the data deluge is as likely to make an enterprise slower, less responsive and, in the long term, less ‘intelligent’.
The problem is volume. Businesses cannot process the sheer amount of data that comes flooding in from myriad sources. All this information takes time to process; often, so much time that the latest data is not available to analyse. The result is that organisations are forced to work with old, out-of-date data. As the flow of information inexorably increases, so the information becomes even more elderly and less relevant.
The much vaunted real-time insight provided by big data analytics is, in fact, anything but. I have seen customers whose supply chain reporting was being delayed by a fortnight or more, simply because they had so much data to process.
A two week wait for data is a long way from the bold future of instant insight that were predicted. The reason for these delays lies in several bottlenecks that complicate and delay the process of dealing with the data.
One of the most common problems is getting hold of the information in the first place. It’s easy to think of ‘big data’ as feeding directly into a single giant repository, but in reality this valuable information is generated and resides in any number of business applications.
Before it can be analysed it needs to be pulled from the enterprise apps (which can number in the thousands for large organisations) and directed to operational data stores. This migration can only be completed in a limited time window, and it is when volumes exceed this window that data begins rapidly to get out of date.
A second related problem is replication – specifically, of databases. Most if not all large organisations are guilty of creating and managing multiple instances of single databases. In fairness, this is hardly their fault: databases are used for a multitude of business processes, by no means limited to test and development, quality assurance (QA), training, back-up and disaster recovery.
This means that, on average, each database is replicated around eight to ten times. Nor are we talking about a few rogue spreadsheets – these databases can often be in the terabyte range.
Useless terabytes act like a sea anchor on any business intelligence system – it takes huge amounts of time and effort to crunch through the replicated data, producing a drag on the whole process. The inevitable delays caused by database replication only add to the problem of ageing data.
The final issue is with data masking. To protect sensitive information within a database, such as personally identifiable data, organisations need to ensure that this information is obfuscated when the database is being used for testing, QA or analysis. Again, this is not too difficult a task – until database replication forces organisation to complete the task multiple times.
It should be obvious how these bottlenecks add significant time to any project, making business intelligence/analysis far more sluggish and less responsive. The damage done by old data does not stop there, however.
Take, for example, the process of product development, whether it’s a jet aircraft or a new piece of software. The lengthier data processing times has ramifications all the way down the project lifecycle, leading managers to make an invidious choice: delay the product release date, or accept a much higher rate of error in the data on which the product design is based.
This pressure to avoid delays in project development is one of the main reasons why businesses end up using older, out-of-date data, with all the risks of error that this entails. In situations such as this, data is as much as a liability as an asset.
Until very recently, the only way to combat these problems was for organisations to consciously limit their data sets by and working with a smaller, more manageable subset of data. Alternatively, they could be brutal about the types of data chosen for real-time reporting. Neither of these approaches is in the spirit of ‘big data’, the ethos of which is that all business information should bring value to an organisation.
Fortunately a new technology has been developed that goes right to the nub of this problem; this technology harnesses the power of virtualisation and applies it to databases. Database virtualisation gives organisations the ability to create multiple copies of the same database – each with their own additions and amendments, and varying degrees of data ‘freshness’ and accuracy. The technology works by making a single copy of each database or dataset and presenting each person with a virtual instance every time it is needed.
This means that, almost overnight, organisations can remove the need for every database replication, slashing the amount of storage space required but, crucially, massively reducing the processing burdens on data analysis systems.
Database virtualisation is still in its infancy, and yet is already bringing astounding results. Organisations that have deployed the technology have seen their processing times shrink from weeks to a few hours. This means that they can begin to achieve the types of insight and business intelligence that ‘big data’ has always promised. With the right technology in place, the big data boast – mine’s bigger than yours – might finally start to acquire some real meaning.