What is Hadoop?

A brief introduction to the open source big data software library

The big data systems in use today started life in Google’s laboratories. In 2003, Ghemawat et al. published a Google File System paper (2003), and this inspired two Google employees, Doug Cutting and Mike Cafarella, to start a project. Google then published another paper called ‘MapReduce: Simplified Data Processing on  Large Clusters’ (Dean, 2008). Initially, the project was named Apache Nutch, but in January 2006, they renamed the project Hadoop (named after a toy elephant that one of the founder’s children played with at the time).

The Hadoop project initially comprised 5,000 lines of code that made up the HDFS (filesystem) and 6,000 lines were for the MapReduce algorithm. Together, these two pieces of software allowed for the management of extensive data collections physically on multiple disks and other media, and large-scale processing of data.

In April 2006, Google released version 0.1.0 of Hadoop as open-source software. It meant anyone could take this code and use it on their projects. In the same way the software that allowed the Internet to flourish (HTTP, Apache web server, PHP, etc.), Hadoop gave the world open access to robust and scalable software for big data analysis. Since those early years, Hadoop’s developers have improved the code significantly, and today it is an extensive suite of powerful tools. The base Hadoop framework has the following modules:

  • Hadoop Common is the core libraries and utilities required by all the Hadoop applications. 

  • Hadoop Distributed File System (HDFS) is the file system that provides storage and also makes a massive amount of bandwidth available for data processing. 

  • Hadoop YARN is the part of the system that manages resources in clusters and that makes them available for user applications running on Hadoop. Finally, 

  • Hadoop MapReduce is the Hadoop implementation of the algorithm that performs data processing.

Hadoop is written in Java (and some C) but many people use other languages to do data analyses. Python is popular because it has excellent libraries targeted explicitly at data scientists (such as Pandas, NumPy, SciPy, MatPlotLib). Most data scientists are doing big data analysis using Spark on top of Hadoop. Spark is an analytics engine written in Scala (which can also be programed in Java and Python), and is the evolution of MapReduce (see https://spark.apache.org). The skills required for managing a Hadoop cluster are not just programming related. Knowledge of operating systems, scripting, hardware and security are all necessary when dealing with big data systems. Also, being able to connect to application programming interface (APIs) in order to deal with streaming data allows real-time analyzes. Data scientists have skills from all these areas and have to spend a lot of their time cleaning and preparing data besides planning the algorithms for data analyzes. 

Hadoop is a powerful software that has made big data accessible to everyone. The Apache Software Foundation manage the project. It is the world’s largest software foundation and this has meant that Hadoop has evolved into one of the most powerful suites of software on the planet. There will be more features and software packages, and this will ensure that the tools for data science will evolve and become even more important for the evolving world of data science. 

A Virus in the Information Age.

How misinformation made things worse, yet good information prevailed

888B7DAC-6D6E-4661-8DCE-B3CE9A604E03-5016-0000078446DDEC83.jpeg

COVID-19 is unquestionably a challenge, because it is a physically difficult illness to deal with and there was a resulting panic like no other in recent history, especially in the media. The proliferation of online news and social media makes it hard to see the truth of what is happening, but scientists across the world have been using information systems to allow people to see the truth of what is happening despite governments and leaders trying to hold back on information.

The initial days of the virus in 2019 were dark days from an information perspective. The Chinese government was not transparent with the information released, and there were many rumors of police suppressing people who tried to talk about the spread of the disease. Soon, it became inevitable that they could not keep this information to themselves, and the effect spread globally.

The next barrier in terms of the truth was social media, where fake news was being published that caused panic and social unrest. The results of this made an already bad situation worse, with people panic buying because they had read that there would be shortages of commodities ranging from toilet roll to meat. The effect of this misinformation was to make people leave home and head to the supermarkets with thousands of other people, a result that cannot be good when a virus is so easily spread between people in proximity.

John Hopkins University has been working tirelessly to provide an information dashboard to the world that has shown the real situation, with none of the claims from politicians impeding the real truth. The Center for Systems Science and Engineering (CSSE) comprises researchers who collaborate with the Department of Civil and Systems Engineering (CaSE) at Johns Hopkins University. They claim to be “united by the goal to better understand and improve societal, health, and technological systems for everyone”*. They have been working around the clock to ensure that unbiased information is available and that it is constantly accurate and up-to-date, and they have allowed the less biased media outlets such as the BBC to publish accurate and realistic insights into the spread of the disease. While Donald Trump has been proclaiming that the “Chinese virus” was barely a threat to his nation, the public could see the progress of the virus and see that his statements were merely poor attempts to hide the truth. They have provided an academic light in these dark times that has allowed anyone, from a worried citizen to the government, with a reliable and trustworthy place to keep track of the situation. They have made the data from their studies available to scientists around the world, and through their hard work are giving us the ability so see the effects of all the efforts to contain and restrict the spread of the disease.

In an age that is defined by information, the researchers of John Hopkins University have provided a lifeline of truth. Along with the medics, police, supermarket workers and pharmacists in the world who have kept day-to-day life going for us, these scientists are also examples of people who are helping immensely. To see their efforts for yourself, visit https://coronavirus.jhu.edu/map.html

* https://systems.jhu.edu/

Research

It’s been a bit quiet here for a good reason. While I’d typically be actively writing here, I’ve been focussing my attention on publishing in academic journals instead.

I'm putting the finishing touches to a research paper that looks at the issues faced by organisations with Big Data, and I am currently doing qualitative research in nine multinational organisations.

This is just the start of a much lore long term research project which I am doing as a part of the Center for Consumer Driven Growth.

We aim to build a model and methodology for diagnosing organisations to help make them more customer-centric. The aim is to identify the product-centric organisations and help them transition to be customer-centric.

As a result, I will not be able to publish as much as I have been doing recently. I will be checking in here with regular updates and I am sure some of the material I write will be extra, and all that will be published here.

Thanks to all who read this blog. If you are a professor like me, good luck with the coming exam and evaluation period, it will soon be Christmas and it will all be over!