AI’s Sentence Embeddings, Demystified | by Ajay Halthor | Aug, 2023

Bridging the gap between computers and language: How AI Sentence Embeddings Revolutionize NLP

Ajay Halthor

Towards Data Science

Photo by Steve Johnson on Unsplash

In this blog post, let’s demystify how computers understand sentences and documents. To kick this discussion off, we will rewind time beginning with the earliest methods of representing sentences using n-gram vectors and TF-IDF vectors. Later sections will discuss methods that aggregate word vectors from neural bag of words to the sentence transformers and language models we see today. There is a lot of fun technology to cover. Let’s begin our journey with the simple, elegant n-grams.

Computers don’t understand words, but they do understand numbers. As such, we need to convert words and sentences into vectors when processing by a computer. One of the earliest representations of sentences as a vector can be traced back to a 1948 paper by Claude Shanon, father of information theory. In this seminal work, sentences were represented as an n-gram vector of words. What does this mean?

Figure 1: Generating n-gram vector from a sentence. (image by author)

Consider the sentence “This is a good day”. We can break this sentence down into the following n-grams:

  • Unigrams: This, is, a, good, day
  • Bigrams: This is, is a, a good, good day
  • Trigrams: this is a, is a good, a good day
  • and much more …

In general, a sentence can be broken down into its constituent n-grams, iterating from 1 to n. When constructing the vector, each number in this vector represents whether the n-gram was present in the sentence or not. Some methods instead could us the count of the n-gram present in the sentence. A sample vector representation of a sentence is shown above in Figure 1.

Another early, yet popular method of representing sentences and documents involved determining TF-IDF vector of a sentence or the “Term Frequency — Inverse Document Frequency” vector. In this case, we would count the number of times a word appears in the sentence to…

Source link

Leave a Comment