Most companies exploring the use of big data for business intelligence purposes do not simply want to analyse unstructured data, they also want to combine the results of that analysis with relevant structured data. They want their analytics to span all sorts of data, which we may refer to as hybrid data.
Unfortunately, NoSQL data stores are not really suitable for storing (and therefore analysing) structured data while conventional data warehouses are not very good at storing (and therefore analysing) unstructured data.
As a result, the architecture that is emerging from a data warehousing perspective is that you store unstructured data in something like Hadoop, do basic analysis work on that data and create summary information that can be passed to the formal data warehouse, where that information can be further analysed. You can either do this through direct integration between the different environments or by means of a federated query environment (such as Composite Software’s) that supports Hadoop.
For large organisations this approach makes sense, but smaller companies with smaller budgets may have an issue with such a potentially expensive solution. One alternative is to store all the information in a warehouse such as that provided by Aster Data (Teradata) or Greenplum (EMC), which support native MapReduce capabilities.
However, there are potential scalability issues if you try and do this. The real problem is that conventional BI tools do not support the analysis of both structured and unstructured data within the same query—which is what you would really like to do. Instead, you have to use MapReduce on the one hand and some SQL-based tool on the other.
However, that does not mean that suitable hybrid tools do not exist. In particular, Endeca Latitude and Connexica (previously ArdentiaSearch) CXAIR, both support query capabilities that span structured and unstructured data.
The two products have different implementations but the same basic philosophy, which is to extract structure from unstructured data and can then combine that with directly structured data, by means of indexes (search-based indexes not database style indexes). Both products are very easy to use (and special emphasis is placed by both companies on how easy it is for end users) and both have a focus on allowing users to explore the data rather than just reporting on it.
However, they are rather different when it comes to their approach to the market. Specifically, Latitude is aimed at companies that want to develop analytic applications to support exploration of hybrid data while CXAIR (which stands for ConneXica Ad-hoc Interactive Reporting) is more aimed at the traditional BI market, albeit that the product is being OEM’d by a number of third parties that have embedded the tool in their own products (in place of, for example, Crystal Reports).
I expect to be writing more about Latitude and CXAIR in the future but to go back to my initial point it seems there is no one-size-fits-all solution to the problem of how to provide BI that spans hybrid data.
There is clearly a choice of warehousing architectures and, no doubt, the leading BI vendors will bolt on unstructured capabilities that will compete with the built-for-purpose technologies from Endeca and Connexica. Quite how all this plays out remains to be seen but if you are interested in hybrid-structured BI right now you should check out Latitude and CXAIR.