OpenVINO Unveiled: Streamlining Model Serving
If are just like me and you hear about OpenVINO for the first time -- you come to the right place.
MLops is multifaced field, with a lot of specificity around the kinds of workloads we are facing.
In one of my previous posts, I was talking about a simple model platform for Kubernetes-native projects. That platform was optimised for multi-skill teams and had a model-chain + self-sufficiency of team members as its primary considerations. Model-chain in other words is a problem of sequential model hosting (as models are expected to execute a workload in a chained fashion).
Today we are going to talk about the problem of parallel model hosting -- that it is how we can host multiple independent models in cost-effiicient manner. We shall start from a typical use case, cover on a high level what the OpenVINO is and finalise with fitting it into a production-grade architecture.
Exercise
To be more specific, imagine an exercise to make an image-styling application.
There was a rise of such applications in the past few years, so hopefully it makes it more interesting to study.
Our application is expected to prompt user to upload an image and to provide a styled one on its output. We would like to have multiple styles -- and this is where we get the multiple model hosting challenge.
If you would like to imagine business application -- it can range from a Freemium application, limiting the number of free image stylings, to a more specialised application (like face/background swapping).
For those who a curious about a hands-on view, you can find more details on my github: https://github.com/ilkadi/image-styler/tree/main
You'll find there a simple docker-compose managed three-tier app including: Typescript+React frontend, Python API, OpenVINO server with models.
OpenVINO
I haven't answered my dear reader's question yet, which is -- what is OpenVINO?
Here is ChatGPTs summary:
OpenVINO is a versatile toolkit that simplifies and accelerates the deployment of artificial intelligence and machine learning models across various hardware platforms, making it easier to integrate AI into real-world applications.
From our pragmatic mlops talk perspective, here is the picture which might be worth a thousand words:
What we see on this picture is the model structure on shared disk, expected by OpenVINO server.
At least on the first glance, we can tell that it allows serving multiple models, supports model versioning and several model formats. In other words -- a pretty awesome way to deploy models in parallel and to update them using an embedded versioning mechanism.
As you have probably noticed, OpenVino is heavily focused on performance in serving models, but I couldn't find out of the box tools supporting model decay monitoring. Something to keep in mind.
Having a closer look, there are certain limitations on types of models allowed to be hosted on OpenVINO server. In particular, there is a native format (OpenVINO IR) which is the most optimal for the execution -- other formats either make an in-flight conversion or not supported.
Actually, I've tried to use a tensor-flow model, but it wasn't working out of the box. Conversion to an IR format haven't passed easily either -- so I opted for ONNX format models instead for an exercise above. Hence, from the practical perspective it is important to make sure that supported technologies match well the problem you are solving.
Big advantage of OpenVINO is the performance focus and support of multi-node model-serving. As it is Intel's baby, OpenVINO exists to support Intel's ecosystem of hardware of the variety of compute capabilities, ranging from IoT devices to GPUs and VPUs. It follows that we can be pretty sure for platform's support and development -- for as long as Intel is in the business at least.
Architectures
While for the simplicity of the exercise I used docker-compose and synchronous REST API calls, it is obviously not a production-grade solution. Let's have a think what it would take to use OVMS in production.
Models require an inference layer before the data is passed to them. Inference can be understood as a conversion layer, from whatever the input is to a very precise format expected by a given model. In the case of images, it is typically the dimensions of the picture and data in a tensor-array format expected by the model among the other things.
Naturally, we have a pairing of backend servers handling inference and a tier of ovms-servers running our models. OVMS prefers batch requests -- so that multiple requests passed to specialised compute units and less time is wasted on loading/unloading the data to those dedicated units (like GPUs).
As we have mentioned earlier, we want to keep a track of uploaded/created pictures -- to collect more data for model training and to be able to set some sort of model decay monitoring. For the context of our exercise cloud buckets are a natural choice, but for other applications we can opt for other solutions ranging from Wide Table databases (if we need high response times) to Data Warehouses (if all what we plan to do is analytics).
With the use of queues (SQS, PubSub, RabbitMQ, Kafka) to glue the whole thing together, two options naturally pop up:
- The first option is to use a push-base queue and to make 1-1 relation between the call of backend server and a call of an ovms server. This way backend can be scaled proportionally to user requests, but we have no way to optimise requests to ovms into batches. On a bigger scale it would mean that we need provision more ovms nodes for the same amount of batched work. It won't make any difference however if the number of expected requests is handled by a single node anyway -- the case of quite many business applications. As this approach does not require an implementation of batch-handling, it is simpler and is probably the first thing to do.
- Once the workload increases (what I wish you from all of my heart), we can implement batch handling on the backend side. Backend servers now need to have a cron check of the queue, pulling from it requests for a certain period of time, pack it into a batch and pass it to ovms. Queue-pressure mechanism should be able to work the same way as in version (1), queue choice/setup might need to be adjusted (PubSub has those exact two modes, so should be easy in this case) -- and we shall see the decrease of ovms nodes number for the same workload.
Summary
What I like about OpenVino is that it does not try to drag you into the end-to-end ecosystem of ML/AI tools. It takes care of one thing -- the hosting; and it specifies the model format for hosting it expects. While the IR format conversion might be limiting, it makes sense that high-performance applications have to satisfy some specific interface -- otherwise precious compute needs to take care of vagueness of input = performance loss.
It is not difficult to imagine a CD pipeline taking ML model from model registry, running conversion to IR format script and uploading it to an expected place (provided that the conversion is possible). Therefore, OpenVino fits the box of model hosting and is worth of your consideration.
Comments
Post a Comment