Can LLMs retro-engineer a consolidated dataset to design the original database and suggest the corresponding data quality checks?
In the continuity of my previous posts on how to leverage Generative AI for Data Activities, I’d like to explore this use case where one Data team receives a consolidated dataset from a function (let’s say Human Resources) and needs to redesign a proper data model in their Data Platform to handle future queries.
We’ll compare the answers from GPT-4 and Bard to determine which model offers the more relevant answers.
(Note: the notebook and data source are available at the end of the article)
Sometimes, business solutions only let you extract information from their proprietary system in the form of reports… and, if you are lucky, they might even be accessible through APIs.
This is the case at “MyCompany” where the HRIS legacy system can only provide one extract of all employees, containing many details regarding the company as well, some of them being confidential.
Following the Data Mesh principles, the Human Resources team would like to expose this data but they also understand that the report cannot be consumed as such, not even mentioning the confidentiality issues that trigger some of the columns like “Salary”, “Age”, or “Annual_Evaluation”.
When interacting with the Data Team, everyone around the table quickly understands that this dataset cannot be broadcasted to all functions/employees and that it needs to be split into multiple tables.
Some of these tables could be leveraged by many for other analyses or use cases:
- the internal departments’ list
- the employees’ list with their email, department, country, and location