Way back in 2019, I published a LinkedIn blog titled Why You Need ML Ops for Successful Innovation. Fast forward to today, operationalizing analytic, machine learning (ML), and artificial intelligence (AI) models (or rather, systems) is still a challenge for many organizations. But, having said that, technology has evolved and new companies have been born to help address the challenges with deploying, monitoring, and updating models in production environments. However, with the recent advancement of generative AI using large language models (LLMs) like OpenAI’s GPT-4, Google’s PaLM 2 Meta’s LLaMA and GitHub Copilot, organizations have raced to understand the value, costs, implementation timelines and risks associated with LLMs. Companies should proceed with caution as we are just at the beginning of this journey and I’d say most organizations are not quite prepared for fine-tuning, deploying, monitoring, and maintaining LLMs.
Machine learning operations (aka MLOps) can be defined as:
ML Ops is a cross-functional, collaborative, continuous process that focuses on operationalizing data science by managing statistical, data science, and machine learning models as reusable, highly available software artifacts, via a repeatable deployment process. It encompasses unique management aspects that span model inference, scalability, maintenance, auditing, and governance, as well as the ongoing monitoring of models in production to ensure they are still delivering positive business value as the underlying conditions change.
Now that we have a clear definition of MLOps, let’s discuss why it matters to organizations.
In today’s algorithmic-fueled business environment, the criticality of MLOps cannot be overstated. As organizations rely on increasingly sophisticated ML models to drive day-to-day decision-making and operational efficiency, the need for a robust, scalable, and efficient system to deploy, manage, monitor, and refresh these models becomes paramount. MLOps provides a framework and set of processes for collaboration between data scientists and computer scientists, who develop the models, and IT operations teams, who deploy, manage and maintain them–making sure models are reliable, up-to-date, and delivering business value.
Broadly speaking, MLOps functionally includes automated machine learning workflows, model versioning, model monitoring, and model governance.
●Automated workflows streamline the process of training, validating, and deploying models; reducing manual effort and increasing speed.
● Model versioning allows for tracking changes and maintaining a registry of model iterations.
● Model monitoring is crucial for making sure models are performing as expected in production systems.
● Model governance provides compliance with regulations and organizational policies.
Together, these capabilities enable organizations to operationalize ML and AI at scale, driving business value and competitive advantage for their organizations.
To ensure that models are performing as expected and delivering optimal predictions in production systems, there are several types of metrics and key performance indicators (KPIs) that are used to track their efficacy. Talk to a data scientist and they will often highlight the following metrics:
● Model Performance Metrics: These are the metrics that measure the predictive performance of a model. They can include accuracy, precision, recall, F1 score, area under the ROC curve (AUC-ROC), mean absolute error (MAE), mean squared error (MSE), etc. The choice of metric depends on the type of problem (classification, regression, etc.) and the business context.
● Data Drift: This measures how much the input data in the production workflow deviates from the data the model was trained on. Significant data drift may indicate that the model’s predictions could become less reliable over time. We saw a great example of this in that little “blip” known as COVID. Consumer habits and business norms changed overnight, causing everyone’s models to break!
● Model Drift: Similar to data drift, this measures how much the model’s performance changes (often degrading) over time rather than measuring how the data distribution is deviating from the norm. This can happen if the underlying data distribution changes, causing the model’s assumptions to become less accurate.
● Prediction Distribution: Tracking the distribution of the model’s predictions can help detect anomalies. For example, if a binary classification model suddenly starts predicting a lot more positives than usual, it could indicate an issue. These often align most closely with business metrics.
● Resource Usage: IT resource usage includes metrics like CPU usage, memory usage, and latency. These metrics are important for ensuring that the model is running efficiently and within the infrastructure and architectural constraints of the system.
● Business Metrics: The most important of all the metrics, these metrics measure the impact of the model on business outcomes. They could include metrics like revenue, customer churn rates, conversion rates an, andnerically, response rates. These metrics help assess whether the model is delivering the expected business value.
So, now that we have a high level understanding of MLOps, why it’s important, key capabilities and metrics, how does this relate to generative AI?
Prior to generative AI becoming mainstream, organizations had primarily implemented AI systems that acted upon structured and semistructured data. These systems were primarily trained on numbers and generated numerical outputs–predictions, probabilities, and group assignments (think segmentation and clustering). In other words, we would train our AI models on historical numeric data like transactional, behavioral, demographic, technographic, firmographic, geospatial, and machine-generated data–and output likelihood to churn, respond or interact with an offer. That’s not to say that we didn’t make use of text, audio, or video data — we did; sentiment analysis, equipment maintenance logs, and others; but these use cases were far less prevalent than numeric based approaches. Generative AI has a new set of capabilities that allow organizations to make use of the data they’ve been essentially ignoring for all these years–text, audio, and video data.
The uses and applications are many, but I’ve summarized the key cross-functional use cases for generative AI (to date).
Generative AI can generate human-like quality content, from audio, video/images, and text.
● Audio content generation: generative AI can craft audio tracks suitable for social media platforms like YouTube, or add AI-powered voiceovers to your written content, enhancing the multimedia experience. In fact, my first two TinyTechGuides have voice overs on Google Play that were completely generated by AI. I could pick the accent, sex, age, and tempo and a few other key attributes for the AI-narrated books. Check out the AI-narrated audiobooks here.
● Text content generation: This is probably the most popular form of generative AI at the moment, from crafting blog posts, social media updates, product descriptions, draft emails, customer letters, to RFP proposals, generative AI can effortlessly produce a wide range of text content, saving businesses significant time and resources. Buyer beware though, just because the content is generated and sounds authoritative does not mean it is factually accurate.
● Image and video generation: We’ve seen this slowly maturing in Hollywood popularized by AI generated characters in the Star Wars franchise to de-aging Harrison Ford in the latest “Indiana Jones” movie, AI can create realistic images and movies. Generative AI can accelerate creative services by generating content for ads, presentations, and blogs. We’ve seen companies like Adobe and Canva make a concerted effort on the creative services front.
● Software code generation: Generative AI can generate software code (like Python) and SQL which can be integrated into analytics and BI systems, as well as AI applications themselves. In fact, Microsoft is continuing research on using ‘text books’ to train LLMs to create more accurate software code.
In addition to creating net-new realistic content for companies, generative AI can also be used to summarize and personalize content. In addition to ChatGPT, companies like Writer, Jasper, and Grammarly are targeting marketing functions and organizations for content summarization and personalization. This will allow marketing organizations to spend time to create a well thought out content calendar and process and then these various services can be fine-tuned to create a seemingly infinite number of variations of the sanctioned content so it can be delivered to the right person in the right channel at the right time.
The third area where generative AI is gaining traction is in content discovery and Q&A. From a data & analytics software perspective, various vendors are incorporating generative AI capabilities to create more natural interfaces (in-plain language) to facilitate the automatic discovery of new datasets within an organization, as well as write queries and formulas of existing datasets. This will allow non-expert business intelligence (BI) users to ask simple questions like, “what is my sales in the northeast region?” and then drill down and ask increasingly more refined questions. The BI and analytics tools automatically generate the relevant charts and graphics based on their query.
We also see an increased use of this in the healthcare industry as well as the legal industry. Within the healthcare sector, generative AI can comb through reams of data and help summarize doctor notes and personalize communications and correspondence with patients via chatbots, email, and the like. There is a reticence to using generative AI solely for diagnostic capabilities, but with a human-in-the-loop, we will see this increase. We will also see generative AI usage increase within the legal profession. Again, a document-centric industry, generative AI will be able to quickly find key terms within contracts, help with legal research, summarize contracts, and create custom legal documents for lawyers. McKinsey dubbed this the legal copilot.
Now that we understand the primary uses associated with generative AI, let’s turn to key concerns.
Generative AI, while promising, comes with its own set of hurdles and potential pitfalls. Organizations must carefully consider several factors before integrating generative AI technology into their business processes. The main challenges include:
● Accuracy Issues (Hallucinations): LLMs can often generate misleading or entirely false information. These responses may seem credible but are entirely fabricated. What safeguards can businesses establish to detect and prevent this misinformation?
● Bias: Organizations must understand the sources of bias in the model and implement mitigation strategies to control it. What company policies or legal requirements are in place to address potential systematic bias?
● Transparency Deficit: For many applications, particularly in sectors like financial services, insurance, and healthcare, model transparency is often a business requirement. However, LLMs are not inherently explainable or predictable, leading to “hallucinations” and other potential mishaps. If your business needs to satisfy auditors or regulators, you must ask yourself, can we even use LLMs?.
● Intellectual Property (IP) Risk: The data used to train many foundational LLMs often includes publicly available information–we’ve seen litigation with the improper use of images (e.g. HBR — Generative AI Has an Intellectual Property Problem), music (The Verge — AI Drake Just Set an Impossible Legal Trap for Google), and books (LA Times — Sara Silverman and Other Bestselling Authors Sue MEta and OpenAI for Copyright Infringement). In many cases, the training process indiscriminately absorbs all available data, leading to potential litigation over IP exposure and copyright infringement. This begs the question, what data was used to train your foundation model and what was used to fine-tune it?
● Cybersecurity and Fraud: With the widespread use of generative AI services, organizations must be prepared for potential misuse by malicious actors. Generative AI can be used to create deep fakes for social engineering attacks. How can your organization ensure that the data used for training has not been tampered with by fraudsters and malicious actors?
● Environmental Impact: Training large-scale AI models requires significant computational resources, which in turn leads to substantial energy consumption. This has implications for the environment, as the energy used often comes from non-renewable sources, contributing to carbon emissions. For organizations who have environmental, social, and governance (ESG) initiatives in place, how will your program account for LLM use?
Now, there are a myriad of other things companies need to consider, but the major ones have been captured. This raises the next question, how do we operationalize generative AI models?
Now that we have a better understanding of generative AI, key uses, challenges, and considerations, let’s next turn to how the MLOps framework must evolve–I have dubbed this, GenAIOps and to my knowledge, I am the first to coin this term.
Let’s take a look at the high level process for the creation of LLMs; the graphic was adapted from On the Opportunities and Risks of Foundation Models.
Figure 1.1: Process to Train and Deploy LLMs
In the above we see that data is created, collected, curated, and models are then trained, adapted, and deployed. Given this, what considerations should be made for a comprehensive GenAIOps framework?
Recently, Stanford released a paper Do Foundation Models Providers Comply with the Draft EU AI Act? After reading that, I used that as inspiration to generate the GenAIOps Framework Checklist below.
○ What data sources were used to train the model?
○ How was the data that was used to train the model generated?
○ Did the trainers have permission to use the data in the context?
○ Does the data contain copyrighted material?
○ Does the data contain sensitive or confidential information?
○ Does the data contain individual or PII data?
○ Has the data been poisoned? Is it subject to poisoning?
○ Was the data genuine or did it include AI-generated content?
○ What limitations does the model have?
○ Are there risks associated with the model?
○ What are model performance benchmarks?
○ Can we recreate the model if we had to?
○ Are the models transparent?
○ What other foundation models were used to create the current model?
○ How much energy and computational resources were used to train the model?
○ Where will the models be deployed?
○ Do the target deployment applications understand that they are using generative AI?
○ Do we have the appropriate documentation to satisfy auditors and regulators?
Now that we have a starting point, let’s take a closer look at the metrics
Using the MLOps metrics and KPIs as a starting point, let’s examine how these may apply to generative AI metrics. We hope that GenAIOps will help address the specific challenges of generative AI, such as the generation of false, fake, misleading, or biased content.
In the context of generative AI, how could an organization measure the performance of the model? I suspect that most organizations will likely use a commercially available pre-trained LLM and will use their own data to fine tune and adapt their models.
Now, there are certainly technical performance metrics associated with text-based LLMs like BLEU, ROUGE, or METEOR and there are certainly others for image, audio, and video but I am more concerned with the generation of false, fake, misleading, or biased content. What controls can an organization put in place to monitor, detect, and mitigate these occurrences?
We’ve certainly seen the proliferation of propaganda in the past and social media giants like Facebook, Google and Twitter have failed to implement a tool that consistently and reliably prevents this from happening. If this is the case, how will your organization measure generative AI model performance? Will you have fact checkers? How about for images, audio, and video? How can you measure the performance of these models?
Given that models take significant resources and time to train, how will model creators determine if the worlds data is drifting and we need a new model? How will an organization understand if their data is evolving to the point where they will need to recalibrate their model? This is relatively straightforward with numeric data, but I think we’re still learning how to deal with unstructured data like text, image, audio, and video.
Assuming that we can create a mechanism to periodically adjust our models, one should also have a control in place to detect if the drifting data is due to true events or a proliferation of AI-generated content? In my post on AI Entropy: The Vicious Circle of AI-Generated Content, I discussed the fact that when you train AI on AI, it becomes dumber over time.
Similar to your model performance and data drift concerns, how will your organization detect and understand if the performance of your model starts to drift? Will you have human-monitors of the output or send surveys to the end users? Perhaps one of the more straight-forward ways to do this is to not only put controls in place to monitor the technical performance of a model, but your company should always track the model outputs. This goes without saying, but you’re using a model to solve a specific business challenge and you need to monitor the business metrics. Are you seeing an increase in cart abandonment, an increase / decrease in customer service calls, or a change in customer satisfaction ratings?
Again, I think we have decent tools and techniques to track this for numeric-based predictions. But now that we are dealing with text, image, audio, and video, how do you think about monitoring prediction distributions? Will we be able to understand if the model output at its deployment target is generating spurious correlations? If so, what can you put in place to measure this phenomenon?
On the surface, this one seems relatively straight forward. However, as generative usage grows within a company, your organization will need to have a system ini place to track and govern it’s usage. Pricing models are still evolving in the generative AI segment, so we need to be careful here. Similar to what we are seeing in the cloud data warehouse space, we are beginning to see costs spiral out of control. So, if your company has usage-based pricing, how will you put financial controls and governance mechanisms in place to make sure your costs are predictable and do not run away on you?
I’ve made this point previously, but the most important set of monitors and controls you can put in place are related to your business metrics. Your company needs to be ever vigilant in monitoring how your models are actually impacting your business on a day-to-day basis? If you’re using this for critical business processes, what SLA guarantees do you have in place to ensure uptime?
Bias is a big concern with any AI-model, but this may be even more acute with generative AI. How will you detect if your model outputs are biased and if they are perpetuating inequalities? There was a great blog on this from Tim O’Reilly titled We Have Already Let the Genie Ot of the Bottle which I encourage you to read.
From an intellectual property perspective, how will you guarantee that proprietary, sensitive or personal information is not escaping or leaking from your organization? Given all of the litigation on copyright infringement going on now, this is am important set of factors that your organization will need to wrestle with. Should you ask vendors to guarantee that these are not in your model similar to Adobe’s play (FastCompany — Adobe is so confident its Firefly generative AI won’t breach copyright that it’ll cover your legal bills)? Now, it’s nice that they will cover your legal bills but what reputational risk does this expose your company to? If you lose the trust of your customers, you may never get them back.
Lastly, data poisoning is certainly a hot topic. When you use your organizations data to adapt and fine tune the model, how can you guarantee that the data is not toxic? How can you guarantee that the data used to train the foundation models was not poisoned?
In the end, the goal of this was not to provide specific methods and metrics on how to address GenAIOps, but rather, pose a series of questions on what organizations need to consider before implementing a LLM. As with anything, generative AI has great potential to help your organization achieve a competitive advantage but also has a set of challenges and risks that need to be addressed. In the end, GenAIOps will need to have a set of principles and capabilities that span both the adopting organization as well as the vendor who is providing the LLM. In the words of Spiderman, with great power comes great responsibility.
If you want to learn more about Artificial Intelligence, check out my book Artificial Intelligence: An Executive Guide to Make AI Work for Your Business on Amazon.