Large Language Models (LLMs) continue to soar in popularity as a new one is released nearly every week. With the number of these models increasing, so are the options for how we can host them. In my previous article we explored how we could utilize DJL Serving within Amazon SageMaker to efficiently host LLMs. In this article we explore another optimized model server and solution in HuggingFace Text Generation Inference (TGI).
NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I would suggest following this article for understanding Deployment/Inference more in depth.
DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.
Why HuggingFace Text Generation Inference? How Does It Work With Amazon SageMaker?
TGI is a Rust, Python, gRPC model server created by HuggingFace that can be used to host specific large language models. HuggingFace has long been the central hub for NLP and it contains a large set of optimizations when it comes to LLMs specifically, look below for a few and the documentation for an extensive list.
- Tensor Parallelism for efficient hosting across multiple GPUs
- Token Streaming with SSE
- Quantization with bitsandbytes
- Logits warper (different params such as temperature, top-k, top-n, etc)
A large positive of this solution that I noted is the simplicity of use. TGI at this moment supports the following optimized model architectures that you can directly deploy utilizing the TGI containers.