Learn about MapReduce and why leading industry giants like Facebook and Google rely on this software framework. Plus, discover where to access this exciting software for free.
MapReduce is a programming model that developers at Google designed as a solution for handling massive amounts of search data. Worldwide, the amount of data we produce has exploded in recent years, with projected data use for 2025 estimated to be over 180 zettabytes [1]. With the rise of big data, it has become increasingly important for people to have access to easy, fast ways of breaking down and processing this information. Since its creation, MapReduce has become a cornerstone in the field of big data.
In this article, we will cover the fundamental details you should know about MapReduce, including what it is, its architecture, advantages and limitations, and more.
In the past, machine computing was typically performed on a singular machine or node. However, as computers have become more mainstream and capable of independent computations, it’s become increasingly important for programming systems to utilize parallel computing for more efficient data analysis and processing. This is where MapReduce comes in.
MapReduce is a programming model designed for big data. It processes different pieces in parallel, breaking down big data tasks into smaller chunks, which makes it significantly easier and faster to process large sets of data. This software framework is fault-tolerant, meaning it can maintain reliable operations and output even when interrupted mid-process.
You can think of MapReduce as a way to “divide and conquer” big data. Imagine you have 1,000 CDs that you want to categorize into genres. Doing so would take a considerable amount of time to do by yourself. However, if you invited 10 people over, each of you could go through and categorize 100 CDs into their genres. After you have each categorized your CDs, you could combine your groups by genre so that you have categories assigned for all of the 1,000 CDs. This is essentially parallel processing when you all work on your individual portion simultaneously. When you combine them, you are aggregating the data into one place.
The MapReduce software framework first maps data into a set of <key, value> pairs and then reduces all the pairs with the same key identifiers. The map function starts the process of sorting through the data and organizing it into understandable pieces. Then, the reduce function takes over, summarizing this information into useful results. Together, they make processing large data sets manageable and efficient.
The map in MapReduce acts like a filtering process. It takes a big problem, breaks it down, and different maps work on each piece individually. For example, if you had a list of names and wanted to count the occurrence of each name, the map function would go through the list and create a key-value pair for each name. In the previous example, where you sorted 1,000 CDs, each of the 10 people is a “map.” Usually, the Hadoop framework automatically decides the number of maps to use based on storage data and processing speeds, which often is the number of inputs you have. However, you can also specify the number of nodes.
The reduce function comes in after the map function has done its job. First, the outputs from the map function are “shuffled” into an organization that the reduce function can condense more easily. This routes keys from different map functions to the same reducer. It combines the output from the map function to form a summarized result.
Let’s say you have a list of words that read [yellow yellow red blue yellow red blue blue yellow]. The map function would break this list into:
<yellow, 1>
<yellow, 1>
<red, 1>
<blue, 1>
<yellow, 1>
<red, 1>
<blue, 1>
<blue, 1>
<red, 1>
<blue, 1>
<yellow, 1>
The reduce function would then reduce this list to read:
<yellow, 4>
<red, 2>
<blue, 4>
This is an oversimplified example of what MapReduce does. Still, it gives you a general idea of how this function takes a large volume of information and reduces it to a more organized, condensed format. In practice, MapReduce may have several steps of reducing and shuffling values to produce an output as efficiently as possible.
Professionals across many industries use MapReduce, but it was popularized through its use at industry giants such as Google, Yahoo, and Facebook. Professionals at Google use MapReduce to cluster articles, create indexes on Google Search (PageRank), compute street maps, and in machine translation. Yahoo similarly uses MapReduce for map functions and spam detection, while Facebook uses MapReduce for data mining and analytics, spam detection, and advertisement strategy.
Other industries that use MapReduce use it for functions such as:
Validating data accuracy in finance to avoid fraud and illegal activity
Archiving and storing medical claims data in health care
Storing and structuring genomic and health data for ongoing research projects
Tracking and processing consumer data such as retail purchases and marketing clicks
By using MapReduce, you may experience several advantages. While these advantages will depend on your use case and need, users generally appreciate MapReduce for the following features.
Elastic flexibility: It can scale quickly and reduce computation times by rearranging nodes for parallel processing.
Fault tolerance: Even if one node fails, the system continues functioning due to node replication.
Load balancing: This method can distribute nodes effectively to manage resources effectively.
Cost efficiency: It uses commodity machines and networks.
As with any software, being aware of limitations is important.
Latency: MapReduce may be slower than some other programs, which is a drawback for applications requiring immediate results.
Unreliability: Commodity machines, while cheaper, may not be as reliable as private machines.
MapReduce is a computing style that is popularly accessed in the open-source Hadoop framework. Apache Hadoop gives access to the commodity servers within an Apache Hadoop cluster, so you can powerfully analyze your data using this programming system.
To use Hadoop MapReduce, you will first need to make sure you have Hadoop installed and configured. Once done, you can create your first cluster on a cluster platform like HDInsight. Steps will differ depending on your configuration setup and whether you want to develop Java MapReduce programs or others, so it’s best to follow a guided tutorial specific to your setup.
You can build your development and analysis skills with exciting courses of all levels on the Coursera learning platform. As a beginner, consider the Cloud Applications Development Foundations Specialization to learn about development environments and data structures. Throughout this beginner-friendly, five-course series, you'll learn about the fundamentals of cloud computing, how to develop applications in the front-end and back-end, and more.
Statista. “Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025, https://www.statista.com/statistics/871513/worldwide-data-created/.” Accessed March 22, 2024.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.