PYTHON
The Pandas library has a variety of built-in methods you can use to process and clean data to make it ready for analysis and machine learning.
As you work with different kinds of data, you’ll often find the need to remove entire rows based on a condition or update part of a string value as part of your data cleaning. You might also want to create new columns from existing ones as a part of your feature engineering process.
Pandas will let you perform a variety of operations on object and string data types with its native transformation methods. In this piece, let’s specifically take a look at how you can replace entire values and/or substrings in the columns in your DataFrames.
Feel free to follow along with the examples in this piece in a notebook! You can download the dataset from Kaggle available free for use under the Open Data Commons Public Domain Dedication and License (PDDL) v1.0. Then import and run the following and we can get started!
import pandas as pd
df_raw = pd.read_csv("Top-Largest-Universities.csv")
Using “replace” in Pandas to edit substring values in a DataFrame Series (Column)
Let’s say we wanted to look at the values in the “Continent” column specifically. We can use the value_counts
method in Pandas that essentially does a group by and on the specified column and then returns a count of unique values in the DataFrame for each column value. This is useful to see how many of each unique value in the column exists in the DataFrame.
df.value_counts("Continent")