The potential of many Big Data projects is the ability to decipher and comprehend patterns within vast pools of information: to glean insight from data that is often highly proprietary or sensitive and which may come from one or more sources within or external to the organisations. Hadoop is clearly becoming the de facto platform for managing these types of projects, but the standard Apache Hadoop distribution is not built with much inherent security.

At the heart of Apache Hadoop is a Kerberos network authentication protocol that uses a database of its clients and their private keys. The private key is a large number known only to Kerberos and the client to which it belongs. In the case that the client is a user, the key is an encrypted password. Network services requiring authentication register with Kerberos, as do clients wishing to use those services.

Deployment Realities

However, managing a Kerberos key distribution system is actually a complex task. In some instances, in the haste to get a Hadoop-based Big Data proof of concept up and running, development teams will forego deploying a full Kerberos system. Although Kerberos is a good system on top of the Linux Pluggable Access Module, competent admins are still in short supply and the management overhead is often not factored into project plans.

This complexity barrier means that the use of Kerberos in the context of Hadoop can be an issue for large enterprise and public sector IT administrators. As pointed out in a recent client advisory note by the highly respected Evaluator Group, “Login authentication is controlled by a centralised key distribution centre (KDC). Due to the way that Kerberos is architected, a different set of host keys will be required for every node in the Hadoop cluster… adding an additional layer of administrative complexity.”

Some enterprise distributions of Hadoop embed the control processes used by Kerberos directly into the management plane. This means that administrators can quickly adhere to baseline security control, and by integrating with the PAM (Pluggable Access Module), access to the cluster can utilise standard authentication servers and two factor methods that are prevalent across the wider enterprise security landscape.

However, as you look at the majority of advanced Hadoop installations, clusters are often segmented behind highly secure and dedicated networks. For the short term, this practice is likely to continue. Administrators of Hadoop clusters who are progressing into critical production environments should consider a Hadoop distribution that is building security into the very operating platform and not as an afterthought.

Best Practice Considerations

Another best practice consideration is the option to encrypt data as it travels from an external source and into the cluster. The recent disclosures around state sponsored spying highlighted the lack of intra-server encryption as a clear weakness of data in transit communication.

Another area to consider is the creation of policies to ensure only access to authorised users and security intelligence to validate and audit these policies. This last stage should be an ongoing process that ties into the wider InfoSec strategy. It is crucial that organisations bring Big Data projects into the same corporate governance and security controls as any other critical system.

The skills used to commission servers, applications and define access controls do not need to be discarded when deploying Hadoop. However, many organisations can benefit from training to refresh these long-standing skills for the Big Data era.

Having security and compliance as considerations early in the project cycle is an absolute must and selecting a Hadoop platform that has security built into the heart are the tenants of what we would consider as best practice.