The aim of this post is to explain and show you how to build data pipelines with Polars. It puts together and uses all the knowledge you’ve got from the previous two parts of this series, so if you haven’t gone through them yet, I highly recommend you start there and come back here later.
You can find all the code in this repository, so don’t forget to clone/pull and star it. In particular, we’ll be exploring this file which means that we’ll finally move away from notebooks into the real world!
Data used in this project can be downloaded from Kaggle (CC0: Public Domain). It’s the same YouTube trending dataset that was used in the previous two parts. I assume that you already have Polars installed, so just make sure to update it to the latest version using pip install -U polars
.
Put simply, a data pipeline is an automated sequence of steps that pulls the data from one or multiple locations, applies processing steps and saves the processed data elsewhere making it available for further use.
Pipelines in Polars
Polars way of working with data lends itself quite nicely to building scalable data pipelines. First of all, the fact that we can chain the methods so easily allows for some fairly complicated pipelines to be written quite elegantly.
For example, let’s say we want to find out which trending videos had the most views in each month of 2018. Below you can see a full pipeline to calculate this metric and to save it as a parquet file.