What is Hadoop?
A brief introduction to the open source big data software library
The big data systems in use today started life in Google’s laboratories. In 2003, Ghemawat et al. published a Google File System paper (2003), and this inspired two Google employees, Doug Cutting and Mike Cafarella, to start a project. Google then published another paper called ‘MapReduce: Simplified Data Processing on Large Clusters’ (Dean, 2008). Initially, the project was named Apache Nutch, but in January 2006, they renamed the project Hadoop (named after a toy elephant that one of the founder’s children played with at the time).
The Hadoop project initially comprised 5,000 lines of code that made up the HDFS (filesystem) and 6,000 lines were for the MapReduce algorithm. Together, these two pieces of software allowed for the management of extensive data collections physically on multiple disks and other media, and large-scale processing of data.
In April 2006, Google released version 0.1.0 of Hadoop as open-source software. It meant anyone could take this code and use it on their projects. In the same way the software that allowed the Internet to flourish (HTTP, Apache web server, PHP, etc.), Hadoop gave the world open access to robust and scalable software for big data analysis. Since those early years, Hadoop’s developers have improved the code significantly, and today it is an extensive suite of powerful tools. The base Hadoop framework has the following modules:
Hadoop Common is the core libraries and utilities required by all the Hadoop applications.
Hadoop Distributed File System (HDFS) is the file system that provides storage and also makes a massive amount of bandwidth available for data processing.
Hadoop YARN is the part of the system that manages resources in clusters and that makes them available for user applications running on Hadoop. Finally,
Hadoop MapReduce is the Hadoop implementation of the algorithm that performs data processing.
Hadoop is written in Java (and some C) but many people use other languages to do data analyses. Python is popular because it has excellent libraries targeted explicitly at data scientists (such as Pandas, NumPy, SciPy, MatPlotLib). Most data scientists are doing big data analysis using Spark on top of Hadoop. Spark is an analytics engine written in Scala (which can also be programed in Java and Python), and is the evolution of MapReduce (see https://spark.apache.org). The skills required for managing a Hadoop cluster are not just programming related. Knowledge of operating systems, scripting, hardware and security are all necessary when dealing with big data systems. Also, being able to connect to application programming interface (APIs) in order to deal with streaming data allows real-time analyzes. Data scientists have skills from all these areas and have to spend a lot of their time cleaning and preparing data besides planning the algorithms for data analyzes.
Hadoop is a powerful software that has made big data accessible to everyone. The Apache Software Foundation manage the project. It is the world’s largest software foundation and this has meant that Hadoop has evolved into one of the most powerful suites of software on the planet. There will be more features and software packages, and this will ensure that the tools for data science will evolve and become even more important for the evolving world of data science.