ChatGPT Code Interpreter: How It Saved Me Hours of Work | by Soner Yıldırım | Jul, 2023

Data cleaning and preprocessing

The next step after reading the data is to clean and preprocess it. I’m sure you heard it several times that this step usually takes the most time.

There are a few issues that need to be handled before creating the world map visualizations. ChatGPT detected them and provided a solution in less than a minute. Quite impressive!

(image by author)

The code in the above snippet:

# code generated by Code Interpreter
# Check if the country names are consistent across both dataframes
population_countries = set(population_data['Country Name'])
coordinates_countries = set(country_coordinates['name'])

# Countries in the population data but not in the coordinates data
missing_in_coordinates = population_countries - coordinates_countries

# Countries in the coordinates data but not in the population data
missing_in_population = coordinates_countries - population_countries

There are some countries missing in the coordinates dataset and also some in the population dataset.

What ChatGPT did in the code snippet above:

  • Create Python sets of country names in both datasets
  • Compare them separately to find the difference as a set

ChatGPT solved it in less than a minute:

The code in the above snippet:

# code generated by Code Interpreter
# Mapping of country names from the population dataset to the coordinates dataset
country_name_mapping = {
'Bahamas, The': 'Bahamas',
'Brunei Darussalam': 'Brunei',
'Cabo Verde': 'Cape Verde',
'Congo, Dem. Rep.': 'Congo [DRC]',
'Congo, Rep.': 'Congo [Republic]',
"Cote d'Ivoire": "Côte d'Ivoire",
'Czechia': 'Czech Republic',
'Egypt, Arab Rep.': 'Egypt',
'Gambia, The': 'Gambia',
'Hong Kong SAR, China': 'Hong Kong',
'Iran, Islamic Rep.': 'Iran',
"Korea, Dem. People's Rep.": 'North Korea',
'Korea, Rep.': 'South Korea',
'Kyrgyz Republic': 'Kyrgyzstan',
'Lao PDR': 'Laos',
'Macao SAR, China': 'Macau',
'Micronesia, Fed. Sts.': 'Micronesia',
'Myanmar': 'Myanmar [Burma]',
'North Macedonia': 'Macedonia [FYROM]',
'Russian Federation': 'Russia',
'Sao Tome and Principe': 'São Tomé and Príncipe',
'Slovak Republic': 'Slovakia',
'Syrian Arab Republic': 'Syria',
'Turkiye': 'Turkey',
'Venezuela, RB': 'Venezuela',
'Yemen, Rep.': 'Yemen',
'Eswatini': 'Swaziland'

# Clean the population dataset
population_data_clean = population_data.replace({"Country Name": country_name_mapping})

# Drop the rows that do not exist in the coordinates dataset
population_data_clean = population_data_clean[population_data_clean['Country Name'].isin(coordinates_countries)]

# Merge the population data with the country coordinates data
merged_data = pd.merge(population_data_clean, country_coordinates, left_on='Country Name', right_on='name')

# Keep only the necessary columns
merged_data = merged_data[['Country Name', '2022', 'latitude', 'longitude']]

# Rename the columns for clarity
merged_data.columns = ['Country', 'Population', 'Latitude', 'Longitude']


Cleaned and preprocessed data (image by author)

We now have the data in a nice and clean format. This step would normally take a lot of time and manual efforts. ChatGPT did it in about a minute.

Source link

Leave a Comment