In this blog post, let’s demystify how computers understand sentences and documents. To kick this discussion off, we will rewind time beginning with the earliest methods of representing sentences using n-gram vectors and TF-IDF vectors. Later sections will discuss methods that aggregate word vectors from neural bag of words to the sentence transformers and language models we see today. There is a lot of fun technology to cover. Let’s begin our journey with the simple, elegant n-grams.
Computers don’t understand words, but they do understand numbers. As such, we need to convert words and sentences into vectors when processing by a computer. One of the earliest representations of sentences as a vector can be traced back to a 1948 paper by Claude Shanon, father of information theory. In this seminal work, sentences were represented as an n-gram vector of words. What does this mean?
Consider the sentence “This is a good day”. We can break this sentence down into the following n-grams:
- Unigrams: This, is, a, good, day
- Bigrams: This is, is a, a good, good day
- Trigrams: this is a, is a good, a good day
- and much more …
In general, a sentence can be broken down into its constituent n-grams, iterating from 1 to n. When constructing the vector, each number in this vector represents whether the n-gram was present in the sentence or not. Some methods instead could us the count of the n-gram present in the sentence. A sample vector representation of a sentence is shown above in Figure 1.
Another early, yet popular method of representing sentences and documents involved determining TF-IDF vector of a sentence or the “Term Frequency — Inverse Document Frequency” vector. In this case, we would count the number of times a word appears in the sentence to…