For accompanying code, see notebook here.
In recent years, large language models (LLMs), have become ubiquitous. Perhaps the most famous LLM is ChatGPT, which was released by OpenAI in November 2022. ChatGPT is able to generate ideas, give personalized recommendations, understand complicated topics, act as a writing assistant, or help you build a model to predict the Academy Awards. Meta has announced their own LLM called LLaMA, Google has LaMDA, and there is even an open-source alternative, BLOOM.
LLMs have excelled in natural language processing (NLP) tasks like the ones listed above because LLMs have historically focused on unstructured data — data that does not have a pre-defined structure, and is usually text-heavy. I asked ChatGPT, “why have LLMs historically focused on unstructured data?” The reply was:
“LLMs have historically focused on unstructured data due to its abundance, availability, and the challenges it poses. Unstructured data provides a vast source for training language models, allowing them to learn patterns, context, and semantics. LLMs excel in handling the complexities and nuances of unstructured data, capturing the subtleties of language and leveraging the rich context, opinions, and creative expressions it contains. While efforts are underway to bridge the gap between LLMs and structured data, the historical emphasis on unstructured data stems from its unique characteristics and the valuable information it offers, allowing LLMs to provide more human-like responses.”
Structured data is usually quantitative and well-organized, usually into rows and columns, and, as ChatGPT notes, there is still a gap between LLMs and structured data.
Knowledge graphs (KG), on the other hand, are excellent at querying structured data. A knowledge graph is,
“directed labeled graph in which domain specific meanings are associated with nodes and edges. A node could represent any real-world entity, for example, people, company, computer, etc. An edge label captures the relationship…