Along with the increasing ubiquity of technology comes the increase in the amount of electronic data. Not many years ago, corporate databases tended to be measured in the range of tens to hundreds of gigabytes.
Now, multi-terabyte (TB) or even petabyte (PB) databases are quite normal, with the World Data Center for Climate (WDDC) storing over 6 petabytes of data overall (although all but around 220TB of this is held in archive on tape) and the National Energy Research Scientific Computing Center (NERSC) having over 2.8PB of available data around atomic energy research as well as physics projects and so on.
Even companies such as Amazon are running with databases in the tens of TB – and companies that most would expect not to have to worry about such massive systems – such as ChoicePoint, a US company that few will have heard of that nevertheless tracks data about the whole of the US population and has one of the largest databases in the world – are dealing with databases in the hundreds of TB.
Others where it is unsurprising that large databases are in place include telecoms companies and service providers. Just dealing with log files of all the events happening across such technology estates can easily build up database sizes. Others include social media sites – although even those that are text-only or primarily text (e.g. Twitter, Facebook) have big enough problems, the likes of YouTube have to deal with massively expanding datasets.
Yet here is the start of the biggest problem – the type of data that has to be dealt with is changing. When it was rows and columns of figures held in a standard database, life was – relatively – simple. It all came down to the speed of the database and the hardware it was running on.
Now, more and more binary large objects (BLOBs) are appearing in the databases – and these require different approaches to be able to identify and report on what the content actually is and in identifying patterns and making sense out of what this means to the end user.
Even worse is the fact that less information is making it into standard databases – yes, there’s still an increasing amount of numerical and textual data being created that resides within a database, but this is being outstripped by the amount of information that is being created in a more ad hoc manner with files that lie directly in a filing system.
At the formal data level, vendors initially used various approaches, such as data warehouses, data marts, data cubes and so on to provide a fast and effective means of analysing and reporting on very large data sets.
When this started to creak at the seams, master data management, data federation and other techniques such as in-memory databases and “sip of the ocean” indicative analysis were brought in to try and keep ahead of the curve. What has become apparent, however, is that such approaches were just stop gaps, and that database vendors have really been struggling to keep up.
To deal with the increasing amount of information being held within databases, an approach termed “big data” has come to the fore. Originally aimed at companies within markets such as oil and gas exploration, pharmaceutical and others dealing with massive data sets, big data looked at how to move away from overly complex and relatively slow systems to one that could provide much greater visibility of what is happening at a data level, enabling those in highly data-centric environments to deal with massive data sets in the fastest time possible.
However, it has evolved into an idea being presented to the commercial world as a means of dealing with their own complex data systems – and also, in some cases, to deal with information being held outside of formal databases themselves.
The Apache Hadoop system is one such approach. This utilises a proprietary file system to create a massively scalable and highly performant platform for dealing with different sorts of data – which can include textual or other data that has been brought into the Hadoop system through e.g. web crawlers or via other search engines.
Another approach was demonstrated by IBM with its Watson computer system made famous through how it won in a special US Jeopardy quiz programme. The Watson system uses a mix of database technology and search systems, along with advanced analytic technologies in order to enable a computer to appear to be “thinking” in the same way that a human does, working backwards from a natural language answer to be able to predict what the question associated with that answer would have been.
Now being developed into a range of applications that can be sold commercially, Watson is not some highly-proprietary system built just for one purpose – IBM purposefully designed it on commercially available hardware and software (such as DB2, WebSphere, InfoStreams and so on) so as to make it useful to the general user in as short a time as possible.
The problem remains that most organisations still regard “data” as rows and columns of numbers that can be mined and reported on using analytical tools that will end up with a graph of some sort. This is why I prefer the term “unbounded data” – the capability to pull together data and information from a range of disparate sources and to make sense of it in a way that an end user needs.
Therefore, when looking for a solution to “big data”, I recommend that organisations look for the following characteristics:
- Can this solution deal with different data types, including text, image, video and sound?
- Can this solution deal with disparate data sources, both within and outside of my organisation’s environment?
- Will the solution create a new, massive data warehouse that will only make my problems worse, or will it use meta data and pointers to minimise data replication and redundancy?
- How can and will the solution present findings back to me – and will this only be based on what has already happened, or can it predict with some degree of certainty what may happen in the future?
- How will the solution deal with backup and restore of data, is it inherently fault tolerant and can I apply more resource easily to the system as required?
With the massive growth of data volumes that is occurring, it is necessary to ensure that whatever solution is chosen, it can deal with such growth for a reasonable amount of time – at least 5 years. Therefore, flexibility is key – and a pure formal data focus based around rows and columns of data will not provide this. With a bit of luck, “big data” will be dead before it harms anyone – long live “unbounded data”!