There are two main classes of big data that I have observed in my company’s customer base: “needle in the haystack” style data mining and mass-scale NoSQL style “big” database applications. In this blog, I wanted to talk about the importance of choosing the right infrastructure services for your needle in the haystack big data workloads.
The needle in the haystack approach to big data involves searching for relationships and patterns within a static or steadily growing mountain of information, hoping to find insights that will help you make better business decisions. These workloads can be highly variable with constant changes in scope and size, especially when you’re just starting out.
These workloads normally require large, backend processing power to analyse the high volume of data. To effectively crunch this type of data and find meaningful needles in your haystack, you need an infrastructure that can accommodate:
- Dynamically changing, periodic usage – Most big data jobs are processed in batches, and require flexible infrastructure that can handle unpredictable, variable workloads
- Large computational needs – “Big” data requires serious processing power to get through your jobs in a reasonable amount of time and provide effective analysis.
So what kind of infrastructure options can support these requirements? While the multi-tenant virtual cloud platforms offer a great economic model and can handle the variable workloads, performance demands become extremely difficult to manage as your use cases evolve and grow.
Big data mining technologies such as Hadoop may work at acceptable levels in virtual environments when you’re just starting out, but they tend to struggle at scale due to high storage I/O, network and computational demands. Performance demands of these workloads become extremely difficult to manage as your use cases evolve and grow.
The virtual, shared and oversubscribed aspects of multi-tenant clouds can lead to problems with noisy neighbours. Big data jobs are some of the noisiest, and ultimately everyone in the same shared virtual environment will suffer, including your big data jobs. An alternative is to build out dedicated infrastructure to alleviate these problems.
This leaves you with two bad options: either deal with subpar performance of virtual pay-as-you-go cloud platforms, or start building your own “expensive” infrastructure. How do you get both the flexibility you need and the high level of performance required to efficiently process big data jobs?
Bare-metal cloud can provide the dedicated storage and compute that you need, along with flexibility for unpredictable workloads. In a bare-metal cloud platform, all compute and direct-attached storage are completely dedicated to your workloads. There are no neighbours, let alone noisy ones, to adversely impact your needs. Best of all, you can get and pay for what your workload specifically needs, and then spin down the whole thing.
One caveat – even with dedicated servers and storage, the network layer is still shared among multiple tenants, which could be a limiting factor for some large-scale Hadoop jobs where wire-speed performance is a must. Even though bare metal is one of the best price for performance cloud options, your workload may not be able to tolerate such limitations as your big data needs grow. Managed hosting or private cloud to the rescue.
Managed hosting or private cloud is a better option in some cases, as the infrastructure is dedicated to you on a private network and can be customized to accommodate your specific needs. These options deliver wire-speed network performance along with dedicated compute, storage and reasonable agility. Of course, this won’t be the most economic option, but if your workload requirements demand this, the tradeoff is well worth it.
Whether you begin your big data endeavour with virtual cloud or bare-metal cloud, it’s important to recognise that your infrastructure needs will change over time. When starting out, a virtual cloud or a bare-metal cloud can suffice, with bare metal providing better performance and scale capabilities. But as your big data needs expand, a fully dedicated, managed private cloud may fit better, without the limitations of a shared network.
Given that change is the only constant in big data, choosing a provider that offers more options and allows you to adjust as your needs change is key.