Enhancing CSV File Query Performance in ChatGPT | by Shuyang Xiang | Jul, 2023

with LangChain’s Self-Querying based on a customized CSV Loader

Shuyang Xiang

Towards Data Science

The advent of sophisticated language models, like ChatGPT, has brought a novel and promising approach to querying tabular data. However, due to token limitations, directly executing a query becomes challenging without the assistance of APIs like retriever. Consequently, the accuracy of queries heavily relies on the query’s quality, and it is not uncommon for standard retrievers to fall short in returning the exact information required.

In this article, I will delve into the reasons behind the failure of conventional retriever methods in certain use cases. Furthermore, we propose a revolutionary solution in the form of a customized CSV data loader that incorporates metadata information. By leveraging LangChain’s Self-Querying API alongside the new CSV data loader, we can extract information with significantly improved performance and precision.

For detailed code used in this article, please take a look of the notebook here. I would like to highlight the fact that this notebook illustrates the possibility that querying big tabular data with an LLM can achieve a remarkable accuracy.

We would like to query the following synthetic SIR dataset created by the author: we simulate the three different groups of the population during 90 days of disease in 10 cities based on a simple SIR model. For the case of simplicity, we suppose that the population of each city ranges from 5e3 to 2e4 and there is no population movement between cities. Moreover, we generate ten random integers between 500 to 2000 as the original infectious number of people.

Image by author: Disease population in 10 cities

The tabular has the following form with five columns: “time” indicating the time when the population was measured, “city” the city where the data was measured and “susceptible”, “infectious”, and “removed” the three groups of the population. For simplicity, the data has been saved locally as a CSV file.

time susceptible infectious removed city
0 2018-01-01 8639 8639 0 city0
1 2018-01-02 3857 12338 1081 city0
2 2018-01-03 1458 13414 2405 city0
3 2018-01-04 545 12983 3749 city0
4 2018-01-05 214 12046 5017 city0

We would like to ask ChatGPT questions relevant to the dataset. To let ChatGPT interact with such tabular data, we follow the following standard steps using LangChain:

  1. Using the CSVLoader to load the data,
  2. Create a vectorstore (we use Chroma here) to store the embed data with OpenAI embeddings,
  3. Use retrievers to return documents relevant to a given unstructured query.

You can use the following code to realize the steps above.

# load data fron local path 
loader = CSVLoader(file_path=LOCAL_PATH)
data = loader.load()

# Create embedding
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(data, embeddings)

# Create retriever
retriever=vectorstore.as_retriever(search_kwargs={"k": 20})

We can now define a ConversationalRetriverChain to query the SIR dataset.


# Define the system message template
system_template = """The provided {context} is a tabular dataset containing Suspectible, infectious and removed population during 90 days in 10 cities.
The dataset includes the following columns:
'time': time the population was meseaured,
'city': city in which the popoluation was measured,
"susceptible": the susceptible population of the disease,
"infectious": the infectious population of the disease,
"removed": the removed popolation of the disease.

# Create the chat prompt templates
messages = [
qa_prompt = ChatPromptTemplate.from_messages(messages)
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=vectorstore.as_retriever(), return_source_documents=False,combine_docs_chain_kwargs={"prompt": qa_prompt},memory=memory,verbose=True)

In the code above, I have defined a conversation chain that will search the relevant information of the query in the SIR dataset using the defined retriever in the previous step and give an answer based on the retrieved information. To give clearer instructions to ChatGPT, I have also given a prompt clarifying the definition of all the columns in the dataset.

Let us now ask one simple question: “Which city has the most infectious people on 2018–02–03?”

Surprisingly, our chatbot said: “The dataset provided does not include data for the date 2018–02–03.”

How could it be possible?

To investigate why the chatbot failed to answer a question whose answer is nowhere but in the provided dataset, I took a look at the relevant document it retrieved with the question “Which city has the most infectious people on 2018–02–03?”. I got the following lines:

[Document(page_content=': 31\ntime: 2018-02-01\nsusceptible: 0\ninfectious: 1729\nremoved: 35608\ncity: city3', metadata={'source': 'sir.csv', 'row': 301}),
Document(page_content=': 1\ntime: 2018-01-02\nsusceptible: 3109\ninfectious: 9118\nremoved: 804\ncity: city8', metadata={'source': 'sir.csv', 'row': 721}),
Document(page_content=': 15\ntime: 2018-01-16\nsusceptible: 1\ninfectious: 2035\nremoved: 6507\ncity: city7', metadata={'source': 'sir.csv', 'row': 645}),
Document(page_content=': 1\ntime: 2018-01-02\nsusceptible: 3481\ninfectious: 10873\nremoved: 954\ncity: city5', metadata={'source': 'sir.csv', 'row': 451}),
Document(page_content=': 23\ntime: 2018-01-24\nsusceptible: 0\ninfectious: 2828\nremoved: 24231\ncity: city9', metadata={'source': 'sir.csv', 'row': 833}),
Document(page_content=': 1\ntime: 2018-01-02\nsusceptible: 8081\ninfectious: 25424\nremoved: 2231\ncity: city6', metadata={'source': 'sir.csv', 'row': 541}),
Document(page_content=': 3\ntime: 2018-01-04\nsusceptible: 511\ninfectious: 9733\nremoved: 2787\ncity: city8', metadata={'source': 'sir.csv', 'row': 723}),
Document(page_content=': 24\ntime: 2018-01-25\nsusceptible: 0\ninfectious: 3510\nremoved: 33826\ncity: city3', metadata={'source': 'sir.csv', 'row': 294}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 1413\nremoved: 35924\ncity: city3', metadata={'source': 'sir.csv', 'row': 303}),
Document(page_content=': 25\ntime: 2018-01-26\nsusceptible: 0\ninfectious: 3173\nremoved: 34164\ncity: city3', metadata={'source': 'sir.csv', 'row': 295}),
Document(page_content=': 1\ntime: 2018-01-02\nsusceptible: 3857\ninfectious: 12338\nremoved: 1081\ncity: city0', metadata={'source': 'sir.csv', 'row': 1}),
Document(page_content=': 23\ntime: 2018-01-24\nsusceptible: 0\ninfectious: 1365\nremoved: 11666\ncity: city8', metadata={'source': 'sir.csv', 'row': 743}),
Document(page_content=': 16\ntime: 2018-01-17\nsusceptible: 0\ninfectious: 2770\nremoved: 10260\ncity: city8', metadata={'source': 'sir.csv', 'row': 736}),
Document(page_content=': 3\ntime: 2018-01-04\nsusceptible: 487\ninfectious: 6280\nremoved: 1775\ncity: city7', metadata={'source': 'sir.csv', 'row': 633}),
Document(page_content=': 14\ntime: 2018-01-15\nsusceptible: 0\ninfectious: 3391\nremoved: 9639\ncity: city8', metadata={'source': 'sir.csv', 'row': 734}),
Document(page_content=': 20\ntime: 2018-01-21\nsusceptible: 0\ninfectious: 1849\nremoved: 11182\ncity: city8', metadata={'source': 'sir.csv', 'row': 740}),
Document(page_content=': 28\ntime: 2018-01-29\nsusceptible: 0\ninfectious: 1705\nremoved: 25353\ncity: city9', metadata={'source': 'sir.csv', 'row': 838}),
Document(page_content=': 23\ntime: 2018-01-24\nsusceptible: 0\ninfectious: 3884\nremoved: 33453\ncity: city3', metadata={'source': 'sir.csv', 'row': 293}),
Document(page_content=': 16\ntime: 2018-01-17\nsusceptible: 1\ninfectious: 1839\nremoved: 6703\ncity: city7', metadata={'source': 'sir.csv', 'row': 646}),
Document(page_content=': 15\ntime: 2018-01-16\nsusceptible: 1\ninfectious: 6350\nremoved: 20708\ncity: city9', metadata={'source': 'sir.csv', 'row': 825})]

Surprisingly, even though I specified that I wanted to know what happened on the date 2018–02–03, no line of that date was returned. Since no information on that date was ever sent to ChatGPT, there is of course no doubt that it cannot answer such a question.

Diving into the source code of the retriever, we can see that the get_relevant_dcouments calls similarity_search by default. The method returns the top n chunks (4 by default but I set the number as 20 in my code) based on the computed distance metric (cosine distance by default) ranging from 0 to 1 which measures the ‘similarity’ between the vector of the query and the vector of the document chunks.

Back to the SIR dataset, we notice that each line tells almost the same story: on which date, in which city, and how many people are marked as which group. There is no surprise the vectors representing these lines are similar to each other. A quick check of the similarity score gives us the fact that lots of lines end up with a score around 0.29.

Therefore, a similarity score is not strong enough to distinguish lines of how they are relevant to the query: This is always the case in tabular data where lines do not have a significant differences between each other. We need stronger filters that can work on the metadata.

It seems evident that the improvement of the chatbot’s performance depends largely on the efficiency of the retriever. To do so, we start by defining a customized CSVLoader with can communicate the metadata information with the retriever.

We write the following code:

class MetaDataCSVLoader(BaseLoader):
"""Loads a CSV file into a list of documents.

Each document represents one row of the CSV file. Every row is converted into a
key/value pair and outputted to a new line in the document's page_content.

The source for each document loaded from csv is set to the value of the
`file_path` argument for all doucments by default.
You can override this by setting the `source_column` argument to the
name of a column in the CSV file.
The source of each document will then be set to the value of the column
with the name specified in `source_column`.

Output Example:
.. code-block:: txt

column1: value1
column2: value2
column3: value3

def __init__(
file_path: str,
source_column: Optional[str] = None,
metadata_columns: Optional[List[str]] = None,
content_columns: Optional[List[str]] =None ,
csv_args: Optional[Dict] = None,
encoding: Optional[str] = None,
# omitted (save as original code)
self.metadata_columns = metadata_columns # < ADDED

def load(self) -> List[Document]:
"""Load data into document objects."""

docs = []
with open(self.file_path, newline="", encoding=self.encoding) as csvfile:
# omitted (save as original code)
if self.metadata_columns:
for k, v in row.items():
if k in self.metadata_columns:
metadata[k] = v
doc = Document(page_content=content, metadata=metadata)
return docs

To save space, I have omitted the code that is the same as the original API and only included the additional few lines which are mainly used to add certain columns that require special attention to the metadata. Indeed, in the printed data above, you can notice two parts: page contents and metadata. The standard CSVLoader writes all the columns of the table to page contents and only data resources and line numbers into the metadata. The defined “MetaDataCSVLoader” allows us to write other columns into the metadata.

We re-create the vector store now, with the same steps as the section above, except for the data loader in which we add two metadata_columns “time” and “city”.

# Load data and set embeddings 
loader = MetaDataCSVLoader(file_path="sir.csv",metadata_columns=['time','city']) #<= modified
data = loader.load()

Now we are ready to use the SelfQuerying API of LangChain:

According to LangChain’s documentation: A self-querying retriever is one that, as the name suggests, can query itself. … This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documents, but to also extract filters from the user query on the metadata of stored documents and to execute those filters.

Image by LangChain: Self-Querying illustration

You can understand now why I emphasize the metadata in the last chapter: it is based on which the ChatGPT or other LLMs can build on the filter to get the most relevant information from the dataset. We use the following code to build such a self-querying retriever by describing metadata and the document content description based on which a well-performed filter can be built to extract the accurate information. In particular, we give a metadata_field_info to the retriever, specifying the type and description of the two columns that we want the retriever to pay more attention to.

description="time the population was meseaured",
description="city in which the popoluation was measured",
document_content_description = "Suspectible, infectious and removed population during 90 days in 10 cities "
retriever = SelfQueryRetriever.from_llm(
llm, vectorstore, document_content_description, metadata_field_info, search_kwargs={"k": 20},verbose=True

We can now define a similar ConversationalRetriverChain to query the SIR dataset, but this time, with the SelfQueryRetriever. Let us see what will happen now when we ask the same question: “Which city has the most infectious people on2018–02–03?”

The chatbot said::” The city with the maximum infectious people on 2018–02–03 is the city3 with 1413 infectious people.”

Ladies and gentlemen, it is correct! The chatbot is doing its job with a better retriever!

There is no harm to see which relevant documents the retiever returns this time and it gives:

[Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 1413\nremoved: 35924\ncity: city3', metadata={'source': 'sir.csv', 'row': 303, 'time': '2018-02-03', 'city': 'city3'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 822\nremoved: 20895\ncity: city4', metadata={'source': 'sir.csv', 'row': 393, 'time': '2018-02-03', 'city': 'city4'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 581\nremoved: 14728\ncity: city5', metadata={'source': 'sir.csv', 'row': 483, 'time': '2018-02-03', 'city': 'city5'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 1355\nremoved: 34382\ncity: city6', metadata={'source': 'sir.csv', 'row': 573, 'time': '2018-02-03', 'city': 'city6'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 496\nremoved: 12535\ncity: city8', metadata={'source': 'sir.csv', 'row': 753, 'time': '2018-02-03', 'city': 'city8'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 1028\nremoved: 26030\ncity: city9', metadata={'source': 'sir.csv', 'row': 843, 'time': '2018-02-03', 'city': 'city9'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 330\nremoved: 8213\ncity: city7', metadata={'source': 'sir.csv', 'row': 663, 'time': '2018-02-03', 'city': 'city7'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 1320\nremoved: 33505\ncity: city2', metadata={'source': 'sir.csv', 'row': 213, 'time': '2018-02-03', 'city': 'city2'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 776\nremoved: 19753\ncity: city1', metadata={'source': 'sir.csv', 'row': 123, 'time': '2018-02-03', 'city': 'city1'}),
Document(page_content=': 33\ntime: 2018-02-03\nsusceptible: 0\ninfectious: 654\nremoved: 16623\ncity: city0', metadata={'source': 'sir.csv', 'row': 33, 'time': '2018-02-03', 'city': 'city0'})]

You might notice at once that there are “time” and “city” in “metadata” in the retrieved documents now.

In this blog post, I have explored the limitations of ChatGPT when querying CSV format datasets, using the SIR dataset from 10 cities over a 90-day period as an example. To address these limitations, I have proposed a novel approach: a metadata-aware CSV data loader that enables us to leverage the self-querying API, significantly improving the accuracy and performance of the Chatbot.

Source link

Leave a Comment