Your customer data is in a SQL database. You’re assigned a task that involves retrieving data from some tables, doing some data cleaning and manipulation, and writing the results to a different table.
Unfortunately, you don’t know how to do those operations with SQL. No worries! You’re great at using Pandas for data cleaning and manipulation. So, you come up with a solution, which is:
- Retrieve all the data from SQL tables
- Download the data as CSV files
- Read the CSV files into Pandas DataFrames
- Perform the required data cleaning and manipulation operations
- Write the results to a different CSV file
- Upload the data in the CSV file to a SQL table
Nice plan right?
If you actually execute this plan, I’m sure your manager will have a talk with you, which can be pleasant or unpleasant depending on your manager’s personality. In any case, I don’t think you’ll execute this awesome plan anymore after the talk.
I know there are usually many different ways of doing a task in data science. You should always aim for the most efficient one because you’ll typically work with very large datasets. Making things more complicated than necessary costs you extra money and time.
“I’m great at Pandas so I’ll do everything with Pandas” is not a desired attitude. If your task involves reading data from SQL tables and writing results to SQL tables, the best way is usually doing the steps in between using SQL.
SQL is not just a query language. It can be used as a highly efficient data analysis and manipulation tool as well.
I remember writing SQL jobs to do very complex data preprocessing operations and they worked just fine.
Data science is still an evolving field. New tools and concepts are introduced in no time. You should not be dependent on a single tool and should always be open to learning new ones.