The term big data has come to mean big headaches for IT organisations and big problems for consumers. Privacy is a growing concern as more and more data is not only collected but voluntarily shared by consumers in exchange for free access to applications and functionality.
Those wondering how much sites such as Facebook might know about them have to jump through hoops to find out and are likely to be surprised by how many personal details websites actually store.
The TV documentary Erasing David, screened on More 4 in 2010, detailed an attempt by film maker David Bond to do just that — find out how private his identity really is. After deliberately disappearing for a month, he hired detectives to track him down.
Before his disappearing act, Bond spent weeks trying to find out just how much information various websites held on him. Big data took on a whole new meaning as he sat at a desk, poring over more than 1,000 printed pages from Facebook alone.
The UK’s Midata initiative
The UK government is proposing to make part of that discovery process easier on the consumer and their wallets with its Midata initiative, whereby consumers would have access to some of their data held by private organisations.
The government is promising protocols to handle privacy or consumer protection issues — but also stresses that this is a private-sector initiative and it will not be hamstrung by rules and regulations.
Given the amount of data stored by private organisations — consider the amount of storage required to maintain data on spending patterns in supermarkets — such an effort raises many questions, from how such vast amounts of data might be transferred to where they might end up.
Perhaps most interesting from a technology perspective is whether or not the cloud makes it possible to overcome the problems that have derailed these kinds of big data-sharing efforts in the past.
The cloud as a data warehouse
Whether it’s sharing of data by the private sector, or attempts at similar initiatives within the US government or vertical industries such as healthcare, data-sharing has issues. The main problems relate to the format of the data, both from an immediate integration perspective as well as for long-term accessibility.
Cloud computing is often touted as a solution to storage and processing of big data thanks to an illusory perception of infinite capacity. But the reality is that if. data is stored in a format that cannot easily be exploited by very different applications, it is little more than a digital dump of bits and bytes.
There is an incorrect assumption that the hardware and infrastructure agnosticism of cloud computing translates equally well to that of data and applications. This fallacy is one that is difficult to expose because of a failure to understand how data is serialised to and from applications. Big data is not well suited to transfer via the most standardised of methods today — RESTful APIs with JSON or XML-encoded data.
Indeed, exchange of big data requires far more care, because of its bulk and the need to ensure formatting in a data protocol easily interpreted by a wide variety of platforms and programming languages. Unfortunately, these two distinct requirements are at odds with one another. Formats most easily interpreted by the widest variety of platforms and languages result in data sets far larger than those encoded in more compact, space-saving formats.
Cloud cannot address this particular problem because it was not designed to do so. Cloud can certainly provide the ubiquity of access and immediate scalability of storage resources necessary to facilitate a successful big data-sharing project. But it cannot address the inadequacies in the transferral process that often plague big-data exchanges.
These obstacles must be addressed before we can even begin to look at access control and management of such a warehouse. Otherwise, we’ll run foul of existing privacy regulations around the globe governing who can access what and from where.
While the UK government insists it does not want to hamper the sharing of data by private-sector initiatives with regulations and laws, many such inhibitors already exist and have a serious impact on the ability to exploit cloud computing for such scenarios.
The Cloud is well suited for many tasks, especially for parallelised analysis of big data. But the data must get into the cloud in the first place and be accessible to those systems performing the analysis. To date there is little sign of initiatives that have proven the ability to do just that in a timely, efficient and highly interoperable manner. This issue may be one of the challenges cloud computing simply cannot address. At least, not yet.