Map, Filter, and CombinePerKey Transforms in Writing Apache Beam Pipelines with Examples | by Rashida Nasrin Sucky | Jul, 2023


Photo by JJ Ying on Unsplash

Let’s Practice with Some Real Data

Rashida Nasrin Sucky

Apache Beam is getting popularity as the unified programming model for efficient and portable big data processing pipelines. It can deal with both batch and streaming data. That’s how the name comes from. Beam is combination of the words Batch and Stream:

B(from Batch) + eam(from stream)= Beam

The portability also is a great feature. You just need to focus on running the pipeline and it can be run from anywhere such as Spark, Flink, Apex, or Cloud Dataflow. You don’t need to change the logic or syntax for that.

In this article, we will focus on learning to write some ETL Pipelines using examples. We will try some transform operations using a good dataset and hopefully you will find all this transform operations useful in your work as well.

Please feel free to download this public dataset and follow along:

Sample Sales Data | Kaggle

A Google Colab notebook is used for this exercise. So, installation is very easy. Just use this line of code:

!pip install --quiet apache_beam

After installation is done, I made a directory for this exercise named ‘data’:

mkdir -p data

Let’s dive into today’s topic that is the transform operations. To start with we will work on a simplest pipeline that is just read the CSV file and Write it to a text file.

This is not as simple as Padas read_csv() method. It requires a coder() opeartion. First a CustomCoder() class was defined here that first encode the objects into byte strings, then decode the bytes to its corresponding objects and finally specifies if this coder is guaranteed to encode values deterministically. Please check the documentation here.

If this is your first Pipeline, please notice the syntax for a pipeline. After the CustomCoder() class there is the simplest pipeline. We initiated the empty pipeline as ‘p1’ first. Then we wrote the ‘sales’ Pipeline where first read the CSV file from the data folder that we created earlier. In Apache beam each transform operation in the…



Source link

Leave a Comment