The big data revolution has meant that there are nuggets of insight within customer data everywhere: CRM data, reviews, engineer logs, complaints, enquiries, surveys, social media etc. The ability to be able to harvest and analyse these unstructured data, and indeed join the heterogeneous datasets to provide predictive, actionable insight is a holy grail for marketers and customer services executives.

Yet firms are getting bogged down, spending too much time and resources cleansing and transforming data. The New York Times estimated that up to 80% of a data scientist’s time is spent “data wrangling”. CrowdFlower estimates “data preparation” at 80%. Further, assumptions and errors are an inevitable part of the process where human judgement and skill is required. More than this, 76% of data scientists view data preparation as the least enjoyable part of their work.

So is this inevitable? Whilst it might seem so, technologies are now appearing which can automate the vast majority of the work to dramatically speed up the insight from heterogeneous data.

Revolutionary algorithms are now able to work with unstructured, disparate and dirty data sets and extract the relevant information required for meaningful insight. Rather than a ‘bottom-up’ approach, where a data scientist needs to extract features, build dictionaries and define variables, it is possible to analyse all the data in a ‘top-down’ way. What this actually means is that exhaustive naïve transformations are performed on the datasets rapidly in real-time to extract meaningful signals no matter what the dataset type or data quality. Where text is involved, Natural Language Processing (“NLP”) is used to semantically parse and substitute synonyms, typos and context-sensitive acronyms.

By using this technology (also known as AIR or ‘Automated Information Retrieval’) signals in text, time-series as well as structured data can be retrieved either by clustering (i.e. unsupervised) or by classification (i.e. supervised) without a dictionary of terms being built a priori. The output, which is now structured, is valuable in itself and can be further analysed alongside any other structured data in a machine learning algorithm for predictive analytics, e.g. to predict customer behaviour from the signals. Dictionaries of specific terms can help sharpen the resolution, and these are automatically suggested by the technology (and validated by an end-user) rather than having to be pre-defined.

Immediately it saves the 80%+ time it takes to typically cleanse and transform data. What’s more, the whole process is automated so that predictive analytics becomes truly a ‘product’ rather than a ‘project’, and it can act as an Early Warning system generating opportunities (such as targeted customer segments who are predictive to have a high take-up of an offer) or threats (such as customers at risk of churning away): Real and relevant recommendations that can be implemented straight away by marketers directly without then requiring interpretation.

Firms no longer need an army of data scientists to gain predictive insight, and can focus on the more valuable and enjoyable part of their roles in defining the problems and analyses, and implementing results. It can be easily implemented and results interpreted by the end user.


One example of this new breed of software is from Warwick Analytics. Notably, the technology was applied and validated at Motorola, the home of Six Sigma, last year to support their quality processes.

Utilities, telcos and media companies are also using the technology to generate recommendations, and prioritise the actions they need to take to improve customer satisfaction and NPS and reduce churn. One large transport provider recently used the new technology to create insight and actionable solutions from unstructured and textual data from their call centres, surveys, social media and review websites. Through the automated analysis they were able to generate predictors and actions for specific customers, to mitigate profitable customers churning and prioritise the most effective outbound actions:

  • They also identified changing customer requirements so they could optimise the adoption of up-sell and cross-sell opportunities; all in real-time.
  • Retailers and CPG firms are mining their vast data lakes of customer data for predictive insight to improve customer basket sizes, product improvements and optimise marketing spend.
  • A leader in the legal expenses insurance market in the UK saved upwards of £300,000 by improving one-call resolution and re-engineering its conversion process using speech analytics.
  • A global technology company reduced the time it needed to develop and send a survey from days to less than 30 minutes and lowered time to insight from up to three weeks to near real-time.
  • Finally, financial institutions are reducing call handling time, elevating contact centre credibility, and increasing their Net Promoter Score.

5 Practical Tips

  1. Complex analytics isn’t the answer to every prayer. If an issue can be solved with a spreadsheet then it should be. There are techniques and algorithms which can be used reliably by team members with the basic knowledge of, say, Minitab or Excel statistical functions, whose outputs are easy to validate and to action.
  2. The “So What” test: Ensure that you know what the output can/will look like beforehand and that it is interpretable to action, and that there is resources to action the action.
  3. If you can avoid cleansing data and making assumptions, then do so. Is dirty data really dirty? If there are outliers they should be allowed to speak for themselves where possible, even if they point to poor processes: Sometimes dirtiness is itself an indicator.
  4. If there is a disconnect when it comes to the buy in from executive level, then the project is likely to fail from the start. In addition, there will be no buy-in to roll out even even if successful. So pick a ‘low hanging fruit’ problem (modest in scope but with the highest impact) initially and get senior buy-in.