If you’re a Hadoop user, you’ll know just how good a tool it is in providing actionable information and analytics not only on large amounts of unstructured data sets, but also on big structured data sets, particularly the way it finds information that other systems would not trace and the speed with which it trawls through such huge amounts of data; and users don’t necessarily need to know the outcome they are looking for – Hadoop does that for them, and finds relationships that you wouldn’t know existed!
Hadoop uses the MapReduce framework to analyse and recognise batches from large data sets and assigns them to cluster nodes within a short space of time, and their Hadoop Distributed File System (HDFS) to link the cluster nodes into one, sometimes enormous, file system. So, I hear you ask, where does private cloud storage come into the equation? Well, Hadoop is suffering from three major issues – node failure, the amount of storage space it consumes, and data migration. And this is where private cloud storage can help.
The fact that HDFS is such a resilient file system is without doubt; the fact that NameNode, its only failure point that lessens Hadoop’s availability, is also without doubt is more worrying. Should HDFS NameNode clusters that utilise HBase (interactive) workloads, real time extract, transform and load, or batch processing workflows suffer an outage, it is a serious issue and downtime occurs, which has a knock-on effect on productivity and usability. Whilst the problem is being worked on and will hopefully be resolved in the release of Hadoop 2.0 later in 2013, private cloud storage providers, such as V-Series, NetApp FAS, Cleversafe Dispersed Storage and EMC Isilon, have found a solution to the problem using their storage capabilities.
Another bigger issue is the fact that due to Hadoop making at least two or three copies of the data batches, which admittedly makes the solution far more resilient, it also takes up three times as much storage space, even if you are using cheap server storage! This storage not only takes up floor and rack space, but also uses an exorbitant amount of power and cooling, but Cleversafe has provided a solution by eliminating the multiple copies of data using dispersed storage erasure coding via a HDFS interface. This dispersed storage technology allows an order of magnitude that is higher in resilience that the standard form of HDFS, and uses as much as 60% less storage space!
Hadoop, in order to process collected data, migrates the data to a Hadoop cluster. Now, whilst this doesn’t take much time (obviously depending on the size of the data to be analysed and processed!), it still is an element of the process that has to be carried out. Enter EMC Isilon who has found a way to cut out this part of the process. EMC Isilon are able to represent CIFS or NFS (SMB1 or SMB2) data sitting in an Isilon storage cluster as HDFS, or outside the Hadoop cluster if necessary, thereby eliminating the need to migrate data.
Until Hadoop and Apache are able to solve these three major usability issues with the release of version 2.0, private cloud storage is able to provide an alternative solution that effectively makes Hadoop the principal player in analysing big data sets.