In 2017, authors from Google published a paper called Attention is All You Need in which they introduced the Transformer architecture. This new architecture achieved unparalleled success in language translation tasks, and the paper quickly became essential reading for anyone immersed in the area. Like many others, when I read the paper for the first time, I could see the value of its innovative ideas, but I didn’t realize just how disruptive the paper would be to other areas under the broader umbrella of AI. Within a few years, researchers adapted the Transformer architecture to many tasks other than language translation, including image classification, image generation, and protein folding problems. In particular, the Transformer architecture revolutionized text generation and paved the way for GPT models and the exponential growth we’re currently experiencing in AI.
Given how pervasive Transformer models are these days, both in the industry and academia, understanding the details of how they work is an important skill for every AI practitioner. This article will focus mostly on the architecture of GPT models, which are built using a subset of the original Transformer architecture, but it will also cover the original Transformer at the end. For the model code, I’ll start from the most clearly written implementation I have found for the original Transformer: The Annotated Transformer from Harvard University. I’ll keep the parts that are relevant to a GPT transformer, and remove the parts that aren’t. Along the way, I’ll avoid making any unnecessary changes to the code, so that you can easily compare the GPT-like version of the code with the original and understand the differences.
This article is intended for experienced data scientists and machine learning engineers. In particular, I assume that you’re well-versed in tensor algebra, that you’ve implemented neural networks from scratch, and that you’re comfortable with Python. In addition, even though I’ve done my best to make this article stand on its own, you’ll have an easier time understanding it if you’ve read my previous article on How GPT models work.
The code in this post can be found in the associated project on GitHub.