Not only has there been an increase in the volume and complexity of data flowing into organisations, there has also been an increase in the number of organisations taking up and implementing Hadoop systems, as well as other big data solutions, to help them manage this influx of structured and unstructured data. But as the name Hadoop becomes familiar to us all, so do the possibly misguided misconceptions about the solution.
Philip Russom, industry analyst and director of research for The Data Warehousing Institute (TDWI), presented 12 facts1 about Hadoop at one of their recent Solution Summits: do they ‘blow the bubble’ on the myths surrounding Hadoop in the industry? We’ll leave it up to you to decide…
Fact 1: Hadoop is a library of multiple products. So, you thought it was one single solution; it’s not, it is a range of open source products that are pooled together and managed by Apache software.
Fact 2: Hadoop is open source and is available via proprietary vendors. Yes, it is open source; yes, it can be downloaded for free; yes, it is available via proprietary vendors, such as Cloudera, IBM and EMC Greenplum. Proprietary vendors are able to offer additional features and tools, plus support and maintenance.
Fact 3: Hadoop is not a single product, it is an ecosystem. The products are developed via the open source market and by vendors, thereby extending and improving the technology, with vendors providing the new products, integrations and platforms.
Fact 4: HDFS is not a database management system, it is a file system. Yes, HDFS can manage data collections, but there are some specific database management features/tools that are missing from Hadoop, i.e. using query indexes to randomly access data.
Fact 5: Hive is similar to SQL, but it’s not SQL standard. Because Hadoop uses Apache Hive and HiveQL, which is similar to SQL but not the same, and because most businesses use tools that are SQL-based, the compatibility issue is raised, potentially short-term, but nevertheless providing a barrier to implementing Hadoop as a mainstream solution.
Fact 6: MapReduce and Hadoop are related, but don’t rely on each other. Before HDFS was developed, Google had MapReduce; and there are vendors out that that have products that are similar to MapReduce yet don’t require HDFS. But the two work well together, with most of the value placed with HDFS due to the ability to layer tools over a distributed file system.
Fact 7: MapReduce is the control for analytics, not the analytics. It is an execution engine that is a basic MPP architecture. It is good at taking hand-coded data, processing it automatically in parallel, and then mapping the results into one set, making it a powerful tool. MapReduce does not do the analytics itself.
Fact 8: Hadoop isn’t just about volume of data; it’s also about the diversity of data. Hadoop has earned the tag of being the best technology for managing big data volumes, but it can also handle a diverse range of data, including full non-structured and semi-structured data; something a lot of data warehouses can’t do.
Fact 9: Hadoop doesn’t replace a data warehouse, but complements one. As data types become more diverse, people believe data warehouses have seen better days; take caution. Data warehouses still have a place and carry out the tasks they were designed to do very well; Hadoop has the ability to complement any data warehouse as it becomes what Russom calls ‘an edge system’.
Fact 10: Hadoop isn’t just about web analytics; it can handle other types of analytics, too. Hadoop is capable of handling a wider range of analytics, making it more appealing to a broader range of organisations and pushing it down the mainstream route, although Russom believes its adoption will take a number of years.
Fact 11: Big data doesn’t need Hadoop. Think of big data and you automatically think of Hadoop, but there are other options in the marketplace – Vertica (owned by HP), Teradata and Sybase IQ (owned by SAP) also offer big data solutions. And what about the companies that have been handling their big data long before Hadoop was even developed?
Fact 12: Hadoop isn’t free! Yes, Hadoop is open source; no, it’s not free to deploy! If you want the admin tools and support, which most organisations will need, that costs. Then there’s the coding within the environment which, because Hadoop doesn’t have an optimiser, will need to be done by a professional, plus the hardware costs to install and make operational a Hadoop cluster… no, it’s not free, nor is it cheap!
Hadoop isn’t the answer to all your prayers; it is a system that can help you handle big data but as a part of the bigger picture, not as a single source solution; and this should be taken into consideration when planning business strategies.