Breaking news from the world of artificial intelligence! OpenAI‘s renowned deep learning expert, Andrej Karpathy, has undertaken an exciting weekend project that could revolutionize how we run complex models on resource-constrained devices. With his creation of “Baby Llama,” a simplified version of the Llama 2 model, Karpathy showcases the power of pure C code and its potential to enable highly interactive rates on small machines. Let’s dive into this game-changing development!
A Quest for Interactive Rates – The Birth of Baby Llama
Driven by his curiosity to explore new possibilities, Andrej Karpathy, a pioneer in the field of deep learning, set out on a mission to unleash the potential of open-source Llama 2. Despite his ability to build GPT-5 over a weekend, Karpathy dedicated his time to experimenting with Llama 2, demonstrating his passion for pushing the boundaries of AI.
Converting GPT-2 to Llama 2: The Weekend Experiment
In his GitHub repository, Llama2.c, Karpathy shared insights into his creative process. He took the nanoGPT framework and skillfully transformed it into the Llama 2 architecture, all written in the C programming language. As a result, his repository garnered significant attention, amassing over 2.2K stars within a short span.
Interactive Rates with Resource-Constrained Models
One of the most astonishing achievements of Karpathy’s experiment is his ability to achieve highly interactive rates with reasonably sized models. Despite using a model containing a few million parameters, trained on a TinyStories dataset with 15 million parameters, Karpathy’s approach succeeded remarkably.
Astounding Speed on Low-Powered Devices
On his M1 MacBook Air, Karpathy managed to achieve impressive results. The Llama 2 model, boasting around 15 million parameters, showcased a blazing inference speed of approximately 100 tokens per second in fp32 (single-precision floating-point) calculations. This surprising outcome underscores the potential of easily running sophisticated models on resource-constrained devices.
Pushing the Limits – Bigger and Better
Encouraged by the initial success, Karpathy continued to push the boundaries. He actively updated the repository and ventured into testing a more substantial 44 million parameter model, which was three times larger. To his amazement, he could train 200k iterations with a batch size of 32 on 4 A100 GPUs in just about eight hours.
Inspiration from LLaMA.cpp and the PyTorch Connection
Karpathy acknowledges that his project was heavily inspired by Georgi Gerganov’s “llama.cpp,” a project that also aimed to use LLaMA on a MacBook using C and C++. Karpathy’s approach began with training the Llama 2 LLM architecture from scratch using PyTorch. He then employed a 500-line C file, “run.c,” to perform inferences with minimal memory usage without needing external libraries.
Fine-Tuning for Enhanced Performance
To further optimize the C code, Karpathy explored various techniques, including different compilation flags like -O3, -Ofast, -march=native, and more. These flags helped enable vectorization, loop unrolling, and other hardware-specific tuning, leading to even faster inferences on specific systems.
Not Ready for Deployment – Yet a Glimpse into the Future
While Karpathy’s weekend experiment has been a groundbreaking success, he clarifies that Baby Llama is not intended for production-grade deployment. The primary objective was to showcase the feasibility of running Llama 2 models on low-powered devices. This experiment challenges the common belief that machine learning requires GPUs.
Shaping the Future of AI on Smaller Devices
The impact of Karpathy’s experiment reaches beyond the realm of weekend projects. It sets a precedent for integrating models on smaller, local devices without needing GPUs. This breakthrough could potentially pave the way for Microsoft, through its partnership with Meta, to release a series of tiny LLMs based on Llama 2, ushering in a new era of AI accessibility.
Andrej Karpathy has launched Baby Llama as a simplified version of the Llama 2 model. Its development showcases the immense potential of running AI models using pure C code on low-powered devices. The model has astounding interactive rates and lightning-fast inferences, promising a great future. This groundbreaking experiment sets the stage for a future where complex AI applications can thrive even on resource-constrained machines. The world of AI is undeniably witnessing a paradigm shift, and Baby Llama might just be the beginning!