When you work on a data science project with a company, you usually don’t have a unique test set, unlike university and research, but you keep receiving newly updated samples from the client.
Before applying the machine learning model to the new sample, you need to verify its data quality, such as the column names, the column types, and the distribution of the fields, which should match the training and old test set.
Manually analyzing the data can be time-consuming when the data is dirty and presents more than 100 features. Luckily, there is a life-saving Python library, called Great Expectations. Did I intrigue you? Let’s get started!
What is Great Expectations?
Great Expectations is an open-source Python library that is specialized in solving three important aspects to manage data:
- validating data by verifying if it respects some important conditions or expectations
- automating data profiling to test your data fastly without the need of starting from scratch
- formatted documents, that contain the results of the expectations and validations.
In this tutorial, we are going to focus on validating data, which is one of the main issues when dealing with real-world data.
Airbnb listings in Amsterdam
We are going to analyze the Airbnb listings provided by Inside Airbnb. We are going to work with data from Amsterdam. The dataset is already split into training and test sets. As you may guess from the name of the dataset, the goal is to predict listing prices. If we just keep attention to the number of reviews, we can notice that the number of reviews on the test data has more variability than the ones of the training set.