Getting Started with Great Expectations: A Guide to Data Validation in Python | by Eugenia Anello | Jul, 2023


Learn how to prevent data quality issues with a few lines of code in Python

Eugenia Anello

Towards Data Science

Photo by Link Hoang on Unsplash

When you work on a data science project with a company, you usually don’t have a unique test set, unlike university and research, but you keep receiving newly updated samples from the client.

Before applying the machine learning model to the new sample, you need to verify its data quality, such as the column names, the column types, and the distribution of the fields, which should match the training and old test set.

Manually analyzing the data can be time-consuming when the data is dirty and presents more than 100 features. Luckily, there is a life-saving Python library, called Great Expectations. Did I intrigue you? Let’s get started!

What is Great Expectations?

Illustration by Author. Source: flaticon.

Great Expectations is an open-source Python library that is specialized in solving three important aspects to manage data:

  • validating data by verifying if it respects some important conditions or expectations
  • automating data profiling to test your data fastly without the need of starting from scratch
  • formatted documents, that contain the results of the expectations and validations.

In this tutorial, we are going to focus on validating data, which is one of the main issues when dealing with real-world data.

Airbnb listings in Amsterdam

We are going to analyze the Airbnb listings provided by Inside Airbnb. We are going to work with data from Amsterdam. The dataset is already split into training and test sets. As you may guess from the name of the dataset, the goal is to predict listing prices. If we just keep attention to the number of reviews, we can notice that the number of reviews on the test data has more variability than the ones of the training set.



Source link

Leave a Comment