You know a truly geeky topic has broken through when The New York Times decides to write about it. Big Data is the latest topic to start to emerge into the larger consciousness, and it’s very geeky indeed—not gadget-level geeky, like phones or video game consoles, but enterprise-level geeky.
Big data, for folks who haven’t already stumbled across the phrase, refers to the fact that many enterprises are now collecting so much data (about their customers, about their own business, about what-have-you) that the data set has eclipsed the scope of the usual tools to handle the volume. We’re talking single datasets in the terabytes at the very least, ranging upwards into petabytes.
The focus of the article in the Times is on the uses of Big Data, which are generally around analyzing these large data sets to produce new insights. Companies like Facebook use the mountains of social and behavioral data about their users to figure out how to target ads, or Amazon might use its customer data to make recommendations for new purchases. In a sense, it’s just data mining, but when both the scale of data to analyze and the available process power to apply to the problem are enormous, the potential results are something altogether new.
Now Big Data is pretty cutting edge stuff, and I focus on practical, everyday problems faced by nearly all IT professionals. Still, when I read a story like this, what comes to mind is the storage problem behind Big Data. If you’ve got a petabyte of data, you’ve got to store it somewhere.
And if you actually want use that data, possibly to serve real-time requests, your storage problem is actually a pretty challenging one, although not all that different from the more prosaic problems of any admin who owns a lot of storage, which is almost any admin.
Storing data can be relatively simple: Buy arrays, carve up the storage into disks that can be accessed by various applications. Storage happens. Managing that storage, on the other hand, is not so easy. Regardless of if you’ve got a true Big Data problem or your own version, writ smaller, you have the same problems.
- You probably need to put your data into tiers, with the frequently-accessed data in high-performance storage and other data in slower, cheaper storage. Some Big Data proponents would say that all data needs to be highly-accessible, but that can be cost-prohibitive. Identifying any places were data can be tiered becomes crucial.
- Storage performance optimization is critical to any project relying on Big Data. If you can’t get data in and out quickly, answers can’t be derived in anything like real time.
- You must be able to identify data contention issues. When where multiple sources try to access data—even different data—that resides on the same physical disk, performance can suffer, and the cause can be hard to find.
- Finally, if you’ve got big data, it’s reasonable to expect that you’re going to get even bigger data. Planning for upcoming capacity needs is critical to long-term success.
While Big Data is currently the province of very large companies, this technology—like most technologies—will trickle down and become mainstream soon enough. Data is plentiful and ever more accessible, so before long, more IT professionals will be faced with the challenges of Big Data.
And to paraphrase Biggie Smalls, “Mo’ data, Mo’ problems.” But they’re fundamentally the same problems, which makes it a bit easier.