Distributed Llama 2 on CPUs

A toy example of bulk inference on commodity hardware using Python, via llama.cpp and PySpark.

Jonathan Apple

Towards Data Science

A computer-generated image of a llama with hundreds of CPUs orbiting it
Image by author via DALL-E

This exercise is about using Llama 2, an LLM (Large Language Model) from Meta AI, to summarize many documents at once. The scalable summarization of unstructured, semi-structured, and structured text can exist as a feature by itself, and also be part of data pipelines that feed into downstream machine learning models.

Specifically, we want to prove the simultaneous feasibility of:

  • Running Llama 2 on CPUs (i.e., removing GPU capacity constraints)
  • Smooth integration of an LLM with Apache Spark (a key part of Big Data ecosystems)
  • No usage of third-party endpoints (i.e., models must run locally due to air-gapped infrastructure or confidentiality requirements)

A lot of the hard work has already been done for us!

The llama.cpp project enables running simplified LLMs on CPUs by reducing the resolution (“quantization”) of their numeric weights. These ready-to-use model files are easily available.

Next, the llama-cpp-python bindings provide simple access to using llama.cpp from within Python.

Finally, Spark’s applyInPandas() (docs) enables splitting giant data sources into Pandas-sized chunks and processing them independently. Note that this approach can be an anti-pattern if vectorized Spark functions can accomplish the same result, but in our case, we’re basically using Spark as a simple orchestrator to scale out our llama.cpp usage. There’s likely more efficient ways to use llama.cpp in batch processing, but this one is attractive given the simplicity and automatic benefits of Spark’s fault tolerance and scalability.

A picture of “War and Peace”, a giant novel by Leo Tolstoy
Creative Commons License (CC BY-SA 3.0)

As a fun test, we’ll be using Llama 2 to summarize Leo Tolstoy’s War and Peace, a 1200+ page novel with over 360 chapters. We’ll treat each chapter as a document. Note that Llama 2 already “knows” about the novel; asking it about a key character generates this output (using llama-2–7b-chat.ggmlv3.q8_0.bin):

Briefly describe the character Anna Pavlovna from 'War and Peace'

Anna Pavlovna is a major character in Leo Tolstoy's novel "War and Peace".
She is a wealthy, intelligent, and charming socialite who is known for her beauty and wit.
Despite her beauty and charm, she is also portrayed as being manipulative and cunning, often using her intelligence and wit to get what she wants.
Throughout the novel, Anna Pavlovna is shown to be a complex and multifaceted character, with both admirable and flawed qualities.
She is particularly close to the main character Pierre Bezukhov, and plays an important role in his personal and emotional development.


  1. Install the 7B quantized chat model and llama-cpp-python.
  2. Download the novel, split by chapter, create a Spark DataFrame.
  3. Partition by chapter and generate summaries.

Configuring a Spark cluster is outside our scope; I’ll assume you have Spark running locally, through a managed service (like Synapse or Elastic Map Reduce), or a custom deployment like Kubernetes.

There are two artifacts that need installed on all worker nodes, whether those nodes are physical machines, VMs, or pods in a serverless pool:

  • LLama 2 model in GGML format (located in /models)
  • The llama-cpp-python module (installed via pip)

We’re using the 7B chat “Q8” version of Llama 2, found here. The download links might change, but a single-node, “bare metal” setup is similar to below:

Ensure you can use the model via python3 and this example. To recap, every Spark context must be able to read the model from /models and access the llama-cpp-python module.

The Bash commands below download the novel and print word counts.

Next, we read the text file in Python, removing the Project Gutenberg header and footer. We’ll split on the regex CHAPTER .+ to create a list of chapter strings and create a Spark DataFrame from them (this code assumes a SparkSession named spark).

The code should produce the following output:

number of chapters = 365
max words per chapter = 3636

| text|chapter|
|\n\n“Well, Prince, so Genoa and Lucca are now just family...| 1|
|\n\nAnna Pávlovna’s drawing room was gradually filling. T...| 2|
|\n\nAnna Pávlovna’s reception was in full swing. The spin...| 3|
|\n\nJust then another visitor entered the drawing room: P...| 4|
|\n\n“And what do you think of this latest comedy, the cor...| 5|
|\n\nHaving thanked Anna Pávlovna for her charming soiree,...| 6|
|\n\nThe rustle of a woman’s dress was heard in the next r...| 7|
|\n\nThe friends were silent. Neither cared to begin talki...| 8|
|\n\nIt was past one o’clock when Pierre left his friend. ...| 9|
|\n\nPrince Vasíli kept the promise he had given to Prince...| 10|

Great! Now we have a DataFrame with 365 rows, each containing the full chapter text and number. The final step is creating a new DataFrame with summaries of each chapter.

Below is the Python code for generating a single chapter summary (see the call to limit(1) to return a single row). Explanation below the snippet:

The llama2_summarize() function is the code that is applied per-group by Spark. Since we’re grouping by the chapter column, the function is called on each chapter row; the df argument is simply a Pandas DataFrame with a single row. Note that we’re reading the model for every call of llama2_summarize(); this is a shortcut we’re taking for simplicity, but not very efficient.

Finally, using Spark we do the groupby() and call applyInPandas(), setting the schema to include the chapter summary and number.

The output (reformatted for readability) looks like this:

The chapter is about a conversation between Prince Vasíli Kurágin and
Anna Pávlovna Schérer, a well-known socialite and favorite
of Empress Márya Fëdorovna.
They are discussing various political matters, including the possibility
of war with France and Austria's role in the conflict.
Prince Vasíli is hoping to secure a post for his son through
the Dowager Empress, while Anna Pávlovna is enthusiastic
about Russia's potential to save Europe from Napoleon's tyranny.
The conversation also touches on personal matters,
such as Prince Vasíli's dissatisfaction with his younger son
and Anna Pávlovna's suggestion that he marry off
his profligate son Anatole to a wealthy heiress.


(Note the use of Napoleon despite the fact it doesn’t occur in the chapter! Again, this is a fun exercise rather than a realistic example using truly unseen documents.)

The runtime for this single chapter test is about 2 minutes on a 64-core VM. There are many choices we glossed over that affect runtime, such as model size/quantization and model parameters. The key result is that by scaling out our Spark cluster appropriately, we can summarize all chapters in a handful of minutes. Processing hundreds of thousands (or even millions!) of documents daily is thus possible using large Spark clusters comprised of cheap virtual machines.

We haven’t even mentioned adjusting the standard LLM parameters like temperature and top_p which control the “creativity” and randomness of results, or prompt engineering, which is practically a discipline of its own. We also chose the Llama 2 7B model without justification; there might be smaller and more performant models or model families more suited to our particular use case.

Instead, we’ve shown how to easily distribute (quantized) LLM workloads using Spark with fairly minimal effort. Next steps might include:

  • More efficient load/caching of models
  • Parameter optimization for different use cases
  • Custom prompts

Source link

Leave a Comment