Understanding MapReduce | Towards Data Science


A deep dive into MapReduce and parallelization

Giorgos Myrianthous

Towards Data Science

Photo by Luca Nicoletti on Unsplash

In the current market landscape, organizations must engage in data-driven decision-making to maintain competitiveness and foster innovation. As a result, an immense amount of data is collected on a daily basis.

Although the challenge of data persistence has largely been resolved, thanks to the widespread availability and affordability of cloud storage, modern organizations continue to grapple with the efficient and effective processing of massive amounts of data.

Over the past few decades, numerous programming models have emerged to address the challenge of processing big data at scale. Undoubtedly, MapReduce stands out as one of the most popular and effective approaches.

What is MapReduce

MapReduce is a distributed programming framework originally developed at Google by Jeffrey Dean and Sanjay Ghemawat, back in 2004 and was inspired by fundamental concepts of functional programming. Their proposal invloved a parallel data processing model consisting of two steps; map and reduce.

In simple terms, map step invovles the division of the original data into small chunks such that transformation logic can be applied to individual data blocks. Data processing can therefore be applied in parallel across the created chunks and finally, the reduce step will then aggregate/consolidate the processed blocks and return the end result back to the caller.

How does MapReduce algorithm work

Even though MapReduce algorithm has been widely known as a two-step process, it actually invovles three distinct stages.

1. Map: In this very first step, the data is split into smaller chunks and distributed across multiple nodes that are usually part of a cluster of processing units. Each chunk created is then assigned to a mapper. The input to the mapper is a set of <key, value> pair. Once the processing is executed on the data (which is once again in the form of <key, value>) the mapper will then write the resulting output to a temporary storage.

As an example, let’s consider the following example where the input text is first split across three…



Source link

Leave a Comment