Apache Beam is getting popularity as the unified programming model for efficient and portable big data processing pipelines. It can deal with both batch and streaming data. That’s how the name comes from. Beam is combination of the words Batch and Stream:
B(from Batch) + eam(from stream)= Beam
The portability also is a great feature. You just need to focus on running the pipeline and it can be run from anywhere such as Spark, Flink, Apex, or Cloud Dataflow. You don’t need to change the logic or syntax for that.
In this article, we will focus on learning to write some ETL Pipelines using examples. We will try some transform operations using a good dataset and hopefully you will find all this transform operations useful in your work as well.
Please feel free to download this public dataset and follow along:
A Google Colab notebook is used for this exercise. So, installation is very easy. Just use this line of code:
!pip install --quiet apache_beam
After installation is done, I made a directory for this exercise named ‘data’:
mkdir -p data
Let’s dive into today’s topic that is the transform operations. To start with we will work on a simplest pipeline that is just read the CSV file and Write it to a text file.
This is not as simple as Padas read_csv() method. It requires a coder() opeartion. First a CustomCoder() class was defined here that first encode the objects into byte strings, then decode the bytes to its corresponding objects and finally specifies if this coder is guaranteed to encode values deterministically. Please check the documentation here.
If this is your first Pipeline, please notice the syntax for a pipeline. After the CustomCoder() class there is the simplest pipeline. We initiated the empty pipeline as ‘p1’ first. Then we wrote the ‘sales’ Pipeline where first read the CSV file from the data folder that we created earlier. In Apache beam each transform operation in the…