Model deployment is tricky; with the continuously changing landscape of cloud platforms and other AI-related libraries updating almost weekly, back compatibility and finding the correct deployment method is a big challenge. In today’s blog post, we will see how to deploy a tflite model on the Google Cloud Platform in a serverless fashion.
This blog post is structured in the following way:
- Understanding Serverless and other ways of Deployment
- What is Quantization and TFLite?
- Deploying TFLite model using GCP Cloud Run API
Let’s first understand what do we mean by serverless because serverless doesn’t mean without a server.
An AI model, or any application for that matter can be deployed in several different ways with three major categorisations.
Serverless: In this case, the model is stored on the cloud container registry and only runs when a user makes a request. When a request is made, a server instance is automatically launched to fulfill the user request, which shuts down after a while. From starting, configuring, scaling, and shutting down, all of this is taken by the Cloud Run API provided by the Google Cloud platform. We have AWS Lambda and Azure Functions as alternatives in other clouds.
Serverless has its own advantages and disadvantages.
- The biggest advantage is the cost-saving, if you don’t have a large user base, most of the time, the server is sitting idle, and your money is just going for no reason. Another advantage is that we don’t need to think about scaling the infrastructure, depending upon the load on the server, it can automatically replicate the number of instances and handle the traffic.
- In the disadvantage column, there are three things to consider. It has a small payload limit, meaning it can be used to run a bigger model. Secondly, the server automatically shuts down after 15 min of idle time, thus when we make a request after a long time, the first requests take much…