Enterprise Data Warehouses (EDWs) are struggling to support rising data volumes, causing performance problems and, in the worst case scenario, making it impossible to meet service level agreements. Sound familiar? Many companies can’t simply solve throw additional infrastructure at the problem because the costs involved are too high. This is especially true in situations where data volumes are growing so fast that the added capacity soon reaches its limits.

According to a report (October 2013) by Cloudera and Syncsort, to store one Terabyte of data in a mainframe costs anywhere from $20,000-100,000. That same data costs roughly $15,000-80,000 to store in a data warehouse. However, it only costs $250-$2500 to store it in a Hadoop cluster. That’s anywhere from six to 400 times cheaper than the alternatives.

But it’s not just about the associated costs. It’s also about getting a complete view of the data to be able to extract insights. Oftentimes companies don’t store certain data sets because their perceived value doesn’t justify the traditionally associated storage costs. Hadoop on the other hand allows companies to store and process the data at a much more efficient cost point enabling them to derive insights based on all of their data.

So it’s no surprise that more and more companies are voting for hybrid architectures that uses a big data store like Hadoop to complement the existing EDW. The hybrid approach relieves pressure on existing infrastructure, lowers EDW costs by reducing the need for expensive DW infrastructure, saves management costs and lowers the barrier for data to become eligible to be stored. These benefits are achieved by being able to offload less frequently used or less valued data and improve performance by moving data transformation workloads to Hadoop.

The EDW ingests data via the Extract/Transform/Load (ETL) process from internal operational systems like CRMs, ERPs and other legacy systems. Users trust EDWs as reliable production environments because data quality and data governance are always enforced.

Offloading the data from the EDW into a Hadoop cluster requires modern data integration tools that connect natively to Hadoop and its distributions and that enable visual ‘orchestration’ of the whole process. This is similar to a classic ETL environment in which no coding or knowledge of legacy scripts are required. Modern integration tools make interacting with highly complex technologies like Hadoop easier than many people expect.

It’s extremely important that data integration tools operate bi-directionally, so that data can be transferred between the EDW and Hadoop store as needed. This is essential because Hadoop can store unstructured data types like social media, log data, machine data and sensor data in volumes that can’t be stored in the EDW. Very often analytical databases are part of these new hybrid environments. These query-oriented databases are designed to process rapidly high data volumes for analytical applications such as Business Intelligence.

5 Signs You May Need To Optimise Your EDW

  1. Do you have untapped high volume data sources that you would like to incorporate into analytics?
  2. Have you seen a rapid proliferation of the number and variety of data sources needed to support business processes?
  3. Are your BI power users demanding direct access to multiple operational systems and creating security concerns as a result?
  4. Do you want to accelerate the process of creating new data sets for predictive analytics?
  5. Do you want to extend cost savings from optimising the data warehouse with Hadoop?