“Information is power” had become something of a cliché by the end of the last century, but it takes on new significance with the concept of Big Data. Our ability to collect, store, access and analyse data, continues to grow thanks to Moore’s Law and other advances in information technology. Global business is prepared to spend millions of dollars mining the mountains of data it has gathered from customers and suppliers.
When does data become Big Data? Big Data can be any type of data – structured or unstructured – such as text, sensor data, audio, video, click streams, log files and more, and analysing these data types in combination can lead to whole new insights – hence the expression “data mining”. Business organisations can be a little cagey about their leading edge techniques, so some of the better examples come from the world of scientific research.
The Sloan Digital Sky Survey began collecting data at the beginning of this century, and in a few weeks it had gathered more data than the total collected in the history of astronomy. If you simply consider visual data – photographs of areas of the night sky – then each photograph provides a lot of data about one small patch of the heavens. If you have one million such photographs covering the whole sky, then you have one million times that amount of data.
But if you now merge all those patches into one image of the total night sky, you are able to answer a whole new set of questions, such as “which are the 10 brightest stars in the sky?” or “what is the average distribution of stars across the heavens?” or “how evenly are stars distributed?” – questions that would be harder to answer from a million separate overlapping pictures.
That example would suggest that it is always best to lump separate datasets into one big file in order to allow more and deeper analysis, but it is not that simple. Older readers may remember the day when, working with a large document in a word processor, it was worth breaking the document up into chapters so that the word processor could work more quickly on the smaller files.
With today’s processors, that is long past but, when you move into the realms of Big Data, the problem returns. As the number of datasets expands, the number of permutations for comparing them increases factorially, leaving even Moore’s law far behind.
Apparently Walmart handles more than one million customer transactions every hour, feeding databases estimated to contain more than 2.5 petabytes of data. This is obviously necessary in order to keep tabs on stock and revenue for tax purposes, but these are relatively simple uses for that data.
More complex analysis might look at patterns and trends of purchasing–seasonal variations, the spread of fashionable buying trends, socio economic buying patterns by region and so on. If a retail chain has many customers using loyalty cards – customers who may have provided much more personal detail about their age and education etc – then it becomes possible to make some very subtle analyses of buying patterns and trends.
This is obviously very interesting for academic research, but is it really valuable for a business? The answer is that it could be, as we enter into a new era of decision-making described by the term “Big Data”.
The attraction is this: going back to that saying “information is power”, the retail chain is sitting on a vast amount of data that can serve simply as a record for its accounting and taxation needs, or it can serve as a gold mine providing precious information that will give the company the edge over its competitors and leave smaller companies with fewer resources far behind.
Provided that the company has the skills, and the IT resources to mine that mountain of data for useful facts about consumer choices and changing trends, and providing that it has the decision-making wisdom to understand and act upon the resulting analyses, then it is in a very strong position to optimise its ongoing business operation.
So the usefulness of Big Data depends upon 3 things. Unless the decision-makers have the analytic skills to make good decisions based on the findings, the whole exercise is a waste of time and money.
Similarly, unless the researchers know what sort of research is going to be useful and how to analyse it in a valid way, then the whole operation might simply flood the decision-makers with a mass of irrelevant or, at worse, misleading facts. Even carefully conducted scientific research often reaches wrong conclusions because of small errors in the initial planning of the analysis.
The third requirement is the key one addressed in this article: does the organisation have the IT resources to handle Big Data? Assuming that we are gathering and storing the necessary data, the question is whether the organisation is able to access and analyse it fast enough to be useful?
The four main detectors in CERN’s Large Hadron Collider produced thirteen petabytes of data in its first year and, famously, resorted to a co-operative hive computing solution spread over thousands of computers worldwide in order to process the results. “Fast enough” in that case meant as fast as it was possible to handle what was an unprecedented mass of data. CERN was not exactly killing time, but nor was it in a competitive situation comparable to most business enterprises.
Healthcare offers more “life or death” pressures, and today’s high definition scans provide massive lumps of data that may need to be distributed across the world to a number of different specialists for analysis. Such data is normally transmitted overnight in order not to hog bandwidth during the day, and this is still generally acceptable in what is a highly critical sector.
At the other extreme – and this may include high frequency trading or military applications in organisations reluctant to share their secrets – there are those who would restrict the use of the term ‘Big Data’ to systems using directly-attached solid state storage for fastest possible, lowest latency access, plus massively parallel processing, to deliver near real-time analyses.
So, where on that scale does your own organisation lie? How big is your data mine? How powerful are your analytics engines? And how fast can the data be continually accessed and used?
It is this last question that is perhaps the easiest to overlook. If the organisation’s data is held on spinning disks distributed across a SAN then it will introduce latencies that may seem insignificant in a single transaction, but which mount up as the number of access queries increases during an extended analysis.
The move, from simply storing a lot of historical data for organisational purposes, to mining that resource intelligently as “Big Data” could mean a massive change in the way data moves around the network, and this could well lead to unexpected consequences.
A vital pre-requisite to any planned Big Data operation must be to test the system to see whether it will be able to deliver the data fast and reliably enough under a range of real world operating conditions. Unless the operation is held in quarantine, it will also be vital to test to see what impact it will have on the rest of the business – will it crash other processes or simply slow down the network and frustrate other users?
There are inevitably many unknowns when stepping into a new venture like this, so a lot of testing will be needed and a lot of flexibility to adjust all the operating parameters and, in the case of failing a test, to reset the system and start again. This requires sophisticated test methodologies and automated testing in order to cover so many possible variations.
Fortunately today’s best network test technology is able to provide this via a simple graphical user interface that allows sophisticated tests to be rapidly set up without specialist scripting skills, to be held in memory and run automatically for consistent testing across any number of test cases.
Modeling truly realistic traffic patterns requires more than just injecting random strings of data – the best test tools will recreate the actual application-level state, modeling cookies, session IDs, NAT translations, ALG-driven modifications, authentication challenges, lengths and checksums.
Also, when networks become congested under normal operation, special algorithms cut in for TCP (transaction control protocol) flow and congestion control that change the behaviours of both the test tool and the target. The best test tools are designed to behave exactly as a real client would in these circumstances.
Expecting an enterprise’s massive database to migrate into an ongoing Big Data resource is like asking a sedentary office worker to get up and take part in a marathon race – it is not impossible, but it depends on testing the worker’s physical state and carefully planning a training regime to adapt for the new role. Without the use of the very best test solutions and methods, the migration to Big Data would simply be a step in the dark.