A Question Answering system can be of great help in analyzing large amounts of your data or documents. However, the sources (i.e., parts of your document) that the model used to create the answer are usually not shown in the final answer.
Understanding the context and origin of responses is valuable not only for users seeking accurate information, but also for developers wanting to continuously improve their QA bots. With the sources included in the answer, developers gain valuable insights into the model’s decision-making process, facilitating iterative improvements and fine-tuning.
This article shows how to use LangChain and GPT-3 (text-davinci-003) to create a transparent Question-Answering bot that displays the sources used to generate the answer by using two examples.
In the first example, you’ll learn how to create a transparent QA bot that leverages your website’s content to answer questions. In the second example, we’ll explore the use of transcripts from different YouTube videos, both with and without timestamps.
Before we can leverage the capabilities of an LMM like GPT-3, we need to process our documents (e.g., website content or YouTube transcripts) in the correct format (first chunks, then embeddings) and store them in a vector store. Figure 1 below shows the process flow from left to right.
Website content example
In this example, we’ll process the content of the web portal, It’s FOSS, which specializes in Open Source technologies, with a particular focus on Linux.
First, we need to obtain a list of all the articles we wish to process and store in our vector store. The code below reads the sitemap-posts.xml file, which contains a list of links to all the articles.