This exercise is about using Llama 2, an LLM (Large Language Model) from Meta AI, to summarize many documents at once. The scalable summarization of unstructured, semi-structured, and structured text can exist as a feature by itself, and also be part of data pipelines that feed into downstream machine learning models.
Specifically, we want to prove the simultaneous feasibility of:
- Running Llama 2 on CPUs (i.e., removing GPU capacity constraints)
- Smooth integration of an LLM with Apache Spark (a key part of Big Data ecosystems)
- No usage of third-party endpoints (i.e., models must run locally due to air-gapped infrastructure or confidentiality requirements)
A lot of the hard work has already been done for us!
Next, the llama-cpp-python bindings provide simple access to using llama.cpp from within Python.
applyInPandas() (docs) enables splitting giant data sources into Pandas-sized chunks and processing them independently. Note that this approach can be an anti-pattern if vectorized Spark functions can accomplish the same result, but in our case, we’re basically using Spark as a simple orchestrator to scale out our llama.cpp usage. There’s likely more efficient ways to use llama.cpp in batch processing, but this one is attractive given the simplicity and automatic benefits of Spark’s fault tolerance and scalability.
As a fun test, we’ll be using Llama 2 to summarize Leo Tolstoy’s War and Peace, a 1200+ page novel with over 360 chapters. We’ll treat each chapter as a document. Note that Llama 2 already “knows” about the novel; asking it about a key character generates this output (using
Briefly describe the character Anna Pavlovna from 'War and Peace'
Anna Pavlovna is a major character in Leo Tolstoy's novel "War and Peace".
She is a wealthy, intelligent, and charming socialite who is known for her beauty and wit.
Despite her beauty and charm, she is also portrayed as being manipulative and cunning, often using her intelligence and wit to get what she wants.
Throughout the novel, Anna Pavlovna is shown to be a complex and multifaceted character, with both admirable and flawed qualities.
She is particularly close to the main character Pierre Bezukhov, and plays an important role in his personal and emotional development.
- Install the 7B quantized chat model and llama-cpp-python.
- Download the novel, split by chapter, create a Spark
- Partition by chapter and generate summaries.
There are two artifacts that need installed on all worker nodes, whether those nodes are physical machines, VMs, or pods in a serverless pool:
- LLama 2 model in GGML format (located in
- The llama-cpp-python module (installed via
We’re using the 7B chat “Q8” version of Llama 2, found here. The download links might change, but a single-node, “bare metal” setup is similar to below:
Ensure you can use the model via
python3 and this example. To recap, every Spark context must be able to read the model from
/models and access the llama-cpp-python module.
The Bash commands below download the novel and print word counts.
Next, we read the text file in Python, removing the Project Gutenberg header and footer. We’ll split on the regex
CHAPTER .+ to create a list of chapter strings and create a Spark
DataFrame from them (this code assumes a
The code should produce the following output:
number of chapters = 365
max words per chapter = 3636
|\n\n“Well, Prince, so Genoa and Lucca are now just family...| 1|
|\n\nAnna Pávlovna’s drawing room was gradually filling. T...| 2|
|\n\nAnna Pávlovna’s reception was in full swing. The spin...| 3|
|\n\nJust then another visitor entered the drawing room: P...| 4|
|\n\n“And what do you think of this latest comedy, the cor...| 5|
|\n\nHaving thanked Anna Pávlovna for her charming soiree,...| 6|
|\n\nThe rustle of a woman’s dress was heard in the next r...| 7|
|\n\nThe friends were silent. Neither cared to begin talki...| 8|
|\n\nIt was past one o’clock when Pierre left his friend. ...| 9|
|\n\nPrince Vasíli kept the promise he had given to Prince...| 10|
Great! Now we have a
DataFrame with 365 rows, each containing the full chapter text and number. The final step is creating a new
DataFrame with summaries of each chapter.
Below is the Python code for generating a single chapter summary (see the call to
limit(1) to return a single row). Explanation below the snippet:
llama2_summarize() function is the code that is applied per-group by Spark. Since we’re grouping by the
chapter column, the function is called on each chapter row; the
df argument is simply a Pandas
DataFrame with a single row. Note that we’re reading the model for every call of
llama2_summarize(); this is a shortcut we’re taking for simplicity, but not very efficient.
Finally, using Spark we do the
groupby() and call
applyInPandas(), setting the schema to include the chapter summary and number.
The output (reformatted for readability) looks like this:
The chapter is about a conversation between Prince Vasíli Kurágin and
Anna Pávlovna Schérer, a well-known socialite and favorite
of Empress Márya Fëdorovna.
They are discussing various political matters, including the possibility
of war with France and Austria's role in the conflict.
Prince Vasíli is hoping to secure a post for his son through
the Dowager Empress, while Anna Pávlovna is enthusiastic
about Russia's potential to save Europe from Napoleon's tyranny.
The conversation also touches on personal matters,
such as Prince Vasíli's dissatisfaction with his younger son
and Anna Pávlovna's suggestion that he marry off
his profligate son Anatole to a wealthy heiress.
(Note the use of Napoleon despite the fact it doesn’t occur in the chapter! Again, this is a fun exercise rather than a realistic example using truly unseen documents.)
The runtime for this single chapter test is about 2 minutes on a 64-core VM. There are many choices we glossed over that affect runtime, such as model size/quantization and model parameters. The key result is that by scaling out our Spark cluster appropriately, we can summarize all chapters in a handful of minutes. Processing hundreds of thousands (or even millions!) of documents daily is thus possible using large Spark clusters comprised of cheap virtual machines.
We haven’t even mentioned adjusting the standard LLM parameters like
top_p which control the “creativity” and randomness of results, or prompt engineering, which is practically a discipline of its own. We also chose the Llama 2 7B model without justification; there might be smaller and more performant models or model families more suited to our particular use case.
Instead, we’ve shown how to easily distribute (quantized) LLM workloads using Spark with fairly minimal effort. Next steps might include:
- More efficient load/caching of models
- Parameter optimization for different use cases
- Custom prompts