Planning to integrate some LLM service into your code? Here are some of the common challenges you should expect when doing so
Large Language Models (LLMs) existed before OpenAI’s ChatGPT and GPT API were released. But, thanks to OpenAI’s efforts, GPT is now easily accessible to developers and non-developers. This launch has undoubtedly played a significant role in the recent resurgence of AI.
It is truly remarkable how quickly OpenAI’s GPT API was embraced within just six months of its launch. Virtually every SaaS service has incorporated it in some way to increase its users’ productivity.
However, only those who have completed the design and integration work of such APIs, genuinely understand the complexities and new challenges that arise from it.
Over the last few months, I have implemented several features that utilize OpenAI’s GPT API. Throughout this process, I have faced several challenges that seem common for anyone utilizing the GPT API or any other LLM API. By listing them out here, I hope to help engineering teams properly prepare and design their LLM-based features.
Let’s take a look at some of the typical obstacles.
Contextual Memory and Context Limitations
This is probably the most common challenge of all. The context for the LLM input is limited. Just recently, OpenAI released context support for 16K tokens, and in GPT-4 the context limitation can reach 32K, which is a good couple of pages (for example if you want the LLM to work on a large document holding a couple of pages). But there are many cases where you need more than that, especially when working with numerous documents, each having tens of pages (imagine a legal-tech company that needs to process tens of legal documents to extract answers using LLM).
There are different techniques to overcome this challenge, and others are emerging, but this would mean you must implement one or more of these techniques yourself. Yet another load of work to implement, test and maintain.
Your LLM-based features likely take some sort of proprietary data as input. Whether you are inputting user data as part of the context or using other collected data or documents that you store, you need a simple mechanism that will abstract the calls of fetching data from the various data sources that you own.
The prompt you submit to the LLM will contain hard-coded text and data from other data sources. This means that you will create a static template and dynamically fill in the blanks with data that should be part of the prompt in run-time. In other words, you will create templates for your prompts and likely have more than one.
This means that you should be using some kind of templating framework because you probably don’t want your code to look like a bunch of string concatenations.
This is not a big challenge but another task that should be considered.
Testing and Fine-tuning
Getting the LLM to reach a satisfactory level of accuracy requires a lot of testing (sometimes it’s just prompt engineering with a lot of trial and error) and fine-tuning based on user feedback.
There are of course also tests that run as part of the CI to assert that all integration work properly but that’s not the real challenge.
When I say Testing, I’m talking about running the prompt repeatedly in a sandbox to fine-tune the results for accuracy.
For testing, you would want a method by which the testing engineer could change the templates, enrich them with the required data, and execute the prompt with the LLM to test that we’re getting what we wanted. How do you set up such a testing framework?
In addition, we need to constantly fine-tune the LLM model by getting feedback from our users regarding the LLM outputs. How do we set up such a process?
LLM models, such as OpenAI’s GPT, have a parameter to control the randomness of answers, allowing the AI to be more creative. Yet if you are handling requests on a large scale, you will incur high charges on the API calls, you may hit rate limits, and your app performance might degrade. If some inputs to the LLM repeat themselves in different calls, you may consider caching the answer. For example, you handle 100K’s calls to your LLM-based feature. If all those calls trigger an API call to the LLM provider, then costs will be very high. Still, if inputs repeat themselves (this can potentially happen when you use templates and feed it with specific user fields), there’s a high chance that you can save some of the pre-processed LLM output and serve it from the cache.
The challenge here is building a caching mechanism for that. It is not hard to implement that; it just adds another layer and moving part that needs to be maintained and done properly.
Security and Compliance
Security and privacy are perhaps the most challenging aspects of this process — how do we ensure that the process created does not cause data leakage and how do we ensure that no PII is revealed?
This is a common challenge for any software company that relies on 3rd party services, and it needs to be addressed here as well.
As with any external API you’re using, you must monitor its performance. Are there any errors? How long does the processing take? Are we exceeding or about to exceed the API’s rate limits or thresholds?
In addition, you will want to log all calls, not just for security audit purposes but also to help you fine-tune your LLM workflow or prompts by grading the outputs.
Let’s say we develop a legal-tech software that lawyers use to increase productivity. In our example, we have an LLM-based feature that takes a client’s details from a CRM system and the general description of the case worked on, and provides an answer for the lawyer’s query based on legal precedents.
Let’s see what needs to be done to accomplish that:
- Look up all the client’s details based on a given client ID.
- Look up all the details of the current case being worked on.
- Extract the relevant info from the current case being worked on using LLM, based on the lawyer’s query.
- Combine all the above info onto a predefined question template.
- Enrich the context with the numerous legal cases. (recall the Contextual Memory challenge)
- Have the LLM find the legal precedents that best match the current case, client, and lawyer’s query.
Now, imagine that you have 2 or more features with such workflows, and finally try to imagine what your code looks like after you implement those workflows. I bet that just thinking about the work to be done here makes you move uncomfortably in your chair.
For your code to be maintainable and readable, you will need to implement various layers of abstraction and perhaps consider adopting or implementing some sort of workflow management framework, if you foresee more workflows in the future.
And finally, this example brings us to the next challenge:
Strong Code Coupling
Now that you are aware of all the above challenges and the complexities that arise, you may start seeing that some of the tasks that need to be done should not be the developer’s responsibility.
Specifically, all the tasks related to building workflows, testing, fine-tuning, monitoring the outcomes and external API usage can be done by someone more dedicated to those tasks and whose expertise is not building software. Let’s call this persona the LLM engineer.
There’s no reason why the LLM workflows, testing, fine-tuning, and so on, would be placed in the software developer’s responsibility — software developers are experts at building software. At the same time, LLM engineers should be experts at building and fine-tuning the LLM workflows, not building software.
But with the current frameworks, the LLM workflow management is coupled into the codebase. Whoever is building these workflows needs to have the expertise of a software developer and an LLM engineer.
There are ways to do the decoupling, such as creating a dedicate micro-service that handles all workflows, but this is yet another challenge that needs to be handled.