The latest entry into the BDDB (big data database) market is Hadapt, which has just announced the public beta (it has been in private beta since June) of its eponymous product.
What is interesting, and different, about Hadapt is that it has a hybrid database with two storage engines. One of these is HDFS (Hadoop distributed file system) and the other is a relational database based on PostgreSQL. So, you store unstructured data in HDFS and structured data in the relational database.
While unusual, this is by no means a unique approach. For example, DB2 has two storage engines: one for relational data and one for XML. However, one other point of note is that Hadapt has been designed specifically for the low-cost clustered environments that are common to BDDB environments and, especially, deployments of these within cloud-based environments.
One of the problems with such public environments of this sort is that you can get what might be called ‘node slowdowns’ where a particular node inexplicably suddenly starts running much slower than normal (perhaps Amazon, or whoever, is performing maintenance).
In order to avert this issue, not only is the unstructured data triplicated (as is standard with HDFS) in Hadapt but so too is the structured data and, if a node starts to slow, then processing can be diverted to one of the other nodes holding that data in order to ensure consistent performance.
Of course, there is a downside in holding two copies of the data but since structured data is only going to be a small part of the whole this should not be onerous and, of course, you get failover and load balancing for free.
The one comment would be is that sometimes you are going to just want to query structured data, in which case you want great traditional analytic performance, and a fairly standard PostgreSQL row-based database probably isn’t go to hack that.
However, we understand from Hadapt that it would be relatively trivial to replace that data store with, say, a column-based one so the potential is there to offer a really powerful combination of structured and unstructured query support (which Hadapt calls multi-structured) within the same system.
In so far as querying the data is concerned, Hadapt is SQL-based. I understand that it has significantly more capabilities (and, in internal tests, is an order of magnitude faster) than Hive, though it is not yet (it will be) ANSI compliant. Nevertheless, this means that you should be able to use your existing business intelligence tools alongside Hadapt.
After queries are received the database software does two things: it generates MapReduce to run against HDFS and it uses what the company calls ‘Split Query Execution’ (for which the company has applied for a patent) so that relevant SQL is directed at the structured data in the relational engine, so that you can have a single query running against both storage engines.
In so far as Hadoop itself is concerned, Hadapt is distribution-agnostic, so that if you are concerned about a NameNode single point of failure, for example, then you could choose to use Hadapt in conjunction with MapR, though most of the company’s beta sites are using Cloudera.
It is far too early to say how successful Hadapt will be. Clearly, the ability to address multi-structured data within a single query can potentially open up new avenues of discovery but the product is not yet fully functional or mature.
However, if it does prove to be successful then it raises one big question mark: to date, the assumption amongst the data warehousing community (vendors and analysts alike) has been that you will have a traditional data warehouse on the one hand and a BDDB alongside it, linked by some sort of data integration technology. Hadapt calls that whole approach into question: why have two products when one can do the job as well or better?
If Hadapt proves successful then we may see other vendors moving in the same direction. For example, Aster Data (Teradata) already supports MapReduce within its warehouse and adding a BDDB storage engine certainly wouldn’t be a stretch. Conversely, as I have already mentioned, IBM already has two storage engines under DB2: why not a third? It could be that Hadapt represents a shift in the way we think about the combination of big data and warehousing in much the same way that Netezza made us think again about data warehousing itself just a few years ago.