Data are fundamental to building Machine Learning models, yet text data for training Machine Learning models are difficult to collect for the following reasons:
- Open-source text datasets are limited. Privacy rules and commercial confidentiality often restrict distribution of privileged data. In addition, publicly available datasets may not be licensed for commercial use, or more critically may not be context relevant. For example, IMDB movie reviews are unlikely to be meaningful for analysing customer sentiments towards Banking products.
- Machine Learning models typically need a large amount of training data to perform. It may take a company, particularly a start-up, considerable time to collect a credible line of text data. In addition, these data may not have been labelled with a response variable for a specific Machine Learning task. For example, a company may have been collecting customer complaints verbatim, but may not necessarily have a granular understanding of the topics or sentiments of these complaints.
How can we overcome the above constraints and generate fit-for-purpose text data in a scalable and cost-effective way? Given the recent advances in Large Language Models and Generative AI, this article* provides a tutorial on generating synthetic text data by calling OpenAI’s GPT model suites in Python.
To demonstrate, let’s explore a use case of generating customer complaints data for an insurance company. With enriched text data for training language models, the use case is that the company could potentially achieve better customer outcomes by performing better in Natural Language Understanding tasks such as categorising complaints into topics or scoring complainant sentiments.
*This article is 100% ChatGPT-free.
Prerequisite: Setting up an OpenAI API key
To be able to call the GPT models, simply register an account with OpenAI and access the API key under User Settings. Make sure to keep this key private.