Neat alternative to Kubeflow for the poor
End-to-end Machine Leaning Lifecycle: (https://ml-ops.org/content/end-to-end-ml-workflow)
Any organisation stepping into the ML/AI space sooner or later realises that it is very different beast compared to that of the software. The team(s) working on it encounter hinderences which seem counterintuitive on the first glimpse, data scientists fast dissappear into their silo and keep complaining about things not working because of.. whatever it is.
Whats even more frustrating is that the book examples don't seem to work exactly as expected -- cloud-native solutions look nice on the video course, but once attempted to be applied they encounter tons of incompatibilities with the de-facto accepted organisational processes and tech stack. Those cause processes of deployment to be overly complicated and makes it difficult for software developers to follow/apply their perspective. Last but probably the most important -- in practice the complete data-driven e2e model life-cycle is rarely achieved, often resulting in blindspots/bottlenecks affecting final product quality.
Very small organisations tend to go all-in to the Cloud Provider tooling -- like GCP's Vertex, AWS's Sagemaker, Azure ML; or might chose to go for something like Databricks if the company knows upfront that ML is its priority. The hidden tradeoff on those choices is often unified security -- or difficulty to organise company's teamwork in such way that people responsible for product security would have space to do their job without hindering Data Science. Obviously there is also a platform lock, capping the organisation to the speed of innovation and market strategies of their respective provider. But there is one more cost of such decision -- the skew to the tool choice diverging from the business goal (compare something more familar, "build application for Vertical Farming" to "build application for Vertical Farming using only Java Spring Boot"; the same happens for ML, just might be less visible).
For those reasons Kubeflow often becomes a tool of choice for larger enterprises, provided that they have people who has skills to setup and maintain it well in organisation's Kubernetes cluster. That choice however also has multiple implications: it creates a certain duplicity in the way software and model containers are built/monitored. What comes together with the incresed team's cognitive load, and pretty much implies larger headcount. The Data and Model decay monitoring still need to be implemented and it is actually not that simple, to the extent of new companies building business to solve just that. Even worse, the smell of "this is complicated" mentally puts ML/AI teams into a silo of the organisation, just because differences become so huge. And silos have a lot of negative impacts on the business futher complicating the things.
Today we shall talk about a much simpler approach implementing the full e2e model lifecycle while keeping team's coognitive load to the minimum.
Problem Statement
Naturally, before talking of solutions we want to understand what is it what we are solving.
Firstly, lets have a look at essential steps which any enterprise-grade ML/AI project has to solve:
- we certainly need to have a way to build the prototype,
- we need a code which would execute the model and bridge it with other systems,
- we need to somehow package it and have processes to deploy it;
- once we have tested the prototype and proved its viability -- we need to have a retraining tool, which would allow us to train various hyperparameter/framework models in one go,
- the training process has to depend on some sort of versioned dataset (like written software depends on versions of libraries -- models depend on data they are trained on),
- trained model binaries need to be stored in some sort of registries with compatory performance evaluations (unlike the code package which can be stored just with version)
- for most of the models we are going to need a drill-down analytics tooling, to help us to make sense of the physical causes behind the accuracy figures (in any business errors in some edge cases are more expensive than in others, final accuracy tends to hide it)
- we need to have a process to integration test our model/model-chains as a part of a product (model-chains, like that of NLP tend to have accuracies affected in non-linear way by its modules)
- we need to be able to automatically compare the data model encounters in production to the data the model was trained/tested on (data drift detection, data poisoning attacks detection etc).
Secondly, lets have a look at the essential skills of who is going to solve it as a team:
- Software developers
- Skilled at big-scale multi-layer software
- Skilled at app-level security and testing
- Know execution level monitoring
- Know CI/CD processes
- Devops
- Skilled at infrastructure-level code and CI/CD
- Skilled at infrastructure-level security and testing
- Know infrastructure level monitoring
- Data Engineers
- Skilled at ETL pipelines
- Skilled at data storages and tradeoffs between them
- Data Science
- Skilled at ML/AI technology application
- Know coding and basic tooling around it (like version control)
- Data Analyst
- Skilled at building BI dashboards
- Skilled at observing data patterns and communicate them
- Quality Assurance
- Skilled at building processes assuring quality
- Skilled at principles behind the quality of the product
- Knows developer tooling
- Agile Coaches
- Skilled at sensing tension and finding the underlying cause
- Skilled at aiding and building the common ground
- (Optional) Domain SME
- Skilled at what ML/AI is supposed to automate
- Are not familiar with IT and its ways
Thirdly, we consider the feasibility of the cognitive load we apply to each individual and what they are skilled to with good quality. I propose to keep it in the back of our minds as a criterion, and for the warm-up, here are few examples:
Data Scientists can write a secure API, but it is outside of their direct skillset. It means that if we expect them to do it, it would cost more time, will be less scalable and secure. On top of that they are going to have less time to be up to date with their field development (whitepapers etc), and it is going to stagnate the development of ML/AI -- something what nobody else is skilled to do.
Software developers can write IaC, but for most of them it would be close to impossible to be up to date with the best practicies in it. It means there would be a steady accumulation of design/implementation technical debt. Importantly, developers are not skilled at infrastructure level security and are likely to eventually use something in their solution what they don't know to be insecure. Finally it will cost more developer-hours to do the same thing when compared to devops-hours.
Software Developers and Data Scientists can be taught to follow TDD, but as focused roles they might like cross-domain perspective to see quality threats appearing on the contact lines between the outputs of their respective works. If software developers are expected to help data scientists with their code quality, it means that software developer time is not applied for something else.
Software Developers and Data Engineers can build model decay monitoring dashboards, but those won't be as good dashboards (and as team-power multiplying) compared to those built by data analysts.
The failure of organisations to recognise the remarkable implications of cognitive load on the team are behind the remarkable 70-80% failure rate of ML/AI projects. Indeed, hire a data scientist and a developer -- you can easily build a prototype to show off on the market, and the prototype is going to solve something on the demo of course. But once we want to make the prototype into a product: (1) something is not enough, (2) team needs to be scaled up drastically in order to succeed.
Low cognitive load Model Platform
Hopefully by this point I bought you into recognising the many steps which successful Model Platform needs to satisfy. Even more importantly -- I hope that the importance of teamwork and team roles alignment to be able to gradually and predictably scale your project. Please make no misitake -- no vendor can resolve the problem statement for you, unless your project is just to wrap what vendors are doing.
So we see that it is more crucial that people know what they are doing than to have a sophisticated technology stack with high level of automation. In other words, the level of processes automation should match the cognitive ability of the team to maintain and evolve them. Beyound that point, project starts to suffer various difficult to pin down setbacks which are incomparably more expensive than the amount of time the automation was supposed to save. Lets translate it into the guidelines of what we expect from the MLops system, from my perspective it is a simple common-sense:
- Each team role has self-sufficiency in the scope of its responsibility zone
- Each team role should be able to deploy verified work to production by following respective process
- Duplicity of processes (same core process different ways to solve it across roles) to be minimised or removed
- Each role can scale the amount of individuals without introducting conflicts (each role can work in parallel)
- No role is responsible (directly or indirectly) for something it is not skilled at
- [Assumption] Kubernetes-like underlying platform is used by the organisation
With those in mind, lets have a look into the simplest possible way to solve our MLops problem.
Model Ambassador
In Devops, Ambassador pattern is a pattern offloading part of the application responsibility into a co-located externally facing dedicated application. For MLops it means the following: (1) model container can be insecure, (2) model container can use any language/framework, (3) data export functionality, required for model decay monitoring, can be offloaded into the ambassador application. You can find the bare bones open-source Java, Spring-Boot model ambassador source code on GitHub.
Lets test the idea against our criterions so far. The pattern offloads data export from data science role to data/software developer roles, which is very good. The pattern offloads application level security and infrastructure level security from data science to software and devops roles respectively, which is awesome.
From the architecture perspective, we can treat it as a "normal" service -- we can calibrate how "hot" our response should be, we can opt for event-driven and on-demand hosting methods. All of our tooling for infrastructure/application-level logs would work just as well out of the box. And of course all of the autoscaling will work just as well.
For model chains (like those needed for NLP or any other process involving more than one model), we can use topic of the distributed queue of your choice to link multiple ambassadors without having to bother data scientists with implementational details. So far so good.
Phases of model development
Practical details always vary, but commonly we can distinguish several phases of model development for enterprises which built more than one model. First is building a prototype, typically in cloud-hosted jupyter notebook of some shape or form; the resulting trained binary needs to be stored somewhere. In the very early stages it is unfortunately a cloud storage. Enterprise-grade projects should have a model registry of some sort if they are serious: Mlflow is free to use so it is given as an example above, but it can be easily replaced with much better alternatives like Neptune or Weights&Biases.
Model registries are something of an uncharted space nowadays, as there are few platforms and each tries to vendor-lock users by providing an ecosystem doing more than one specific thing. The reality is that the problem they are solving is super simple -- keep a track of metadata for relatively big binaries (=models). This is the direct responsibility of model registry, just like container registries hold the same for docker containers. However most of the platforms (with the exception of Neptune), try to step into your model space to become a middle man for some other aspect -- as it is difficult to make money from simply storing binaries. Anyway, I degress.
Once we have a model binary stored in model registry, it means that we can put this binary into container with a gentle splash of the well-tested glue code to run it. This is what I call the second stage of model development, as the well-tested glue code, CI running tests, building container, running scans -- they don't need to change later much, for the post-prototype model quality. The complition of the second stage means that we can reliably and cheaply build containers with respective model binary pulled from the model registry and glue code executing it.
The third stage is a sort of a 'productionising' of the code used to build a prototype -- just it needs to move from Jupyter notebook to some dedicated living place. The unspoken so far assumption was that we can train model on various nodes (CPU, GPU, TPU..) on our Jupyter. So how do we go about the third phase? Third phase is to ensure that there is a CI process with a job training a model (or multiple competing models), evaluating it against the test set and exporting binary(s) to the registry. Arguably the analytics export part and the setup of CI job runners with GPUs is the only work needed to be done at this stage.
Lets test this conceptual setup against our criterions. Model registry setup and maintenance clearly falls under the devops role, so the cognitive load for data scintists is not affected. Export of model binaries with metadata in the end of training is well within the scope of the data science role skillset. Building container with glue code will require deveops/software-developers to build the mock setup first, but after that data scientist is self-sufficient. Somehow the deployment by data scientists should be made simple enough, but ambasssador pattern already gives us a promise that it can be done. Model training would require some cross-role work as well on the initial stage, but should be fine later. Little bit itchy this one, but passable.
How shall we deploy
Looking into the high-level process setup we found that for data science role to maintain self-sufficiency, the deployment process simplicity is crucial.
Fortunately the combination of Kubernetes and Gitops offers many solutions satisfying our expectations. Kubernetes charts, combined with gitops tool like flux -- allow the team to have the same Pull Request deployment process for all roles in the team, while requiring Data Science to be responsible only for the hashtag of the model container.
In other words, Devops role would setup chart templates, software developer role would setup the ambassador application, data science role will specify what model container will run there. Once everything is set, everybody can work in parallel and Git would take care of potential work conflicts -- awesome, as we don't need to introduce cognitive load with new types of tooling.
Even nicer, we can setup in a separate cluster/namespace jupyter notebooks for our data scientists, so that they can deploy production models and select notebooks using the same interface. Even better -- containers with models can be attached as a side-car container to Jupyter notebooks to be tested manually before the deployment if needed; or if your processes are that mature -- added as a side-car to model-tester application. The key thing that all of those would use the same deployment interface: update hashtag, raise PR, pass the CI pipeline, get approval, merge. Note the cognitive load reference here -- we keep the variety of tooling/types of processes to the bare minimum.
So the solution on the diagram hits two birds with one stone: the same process is used for hosting the prototyping tooling and the deployment of models to dev/staging/production environment. The bones of those need to be setup and maintained by devops, but data scientist has a clear step-by-step tunnel which can be followed. Proper roles are responsible for security. Awesome.
Back to Phase I: building a model prototype
With this self-hosted tooling, we now can implement the Phase I of model lifecycle, with the brinary protype being finished as a result. In practice, almost certainly you would like to use some internal tooling library for repeatable code and to use a dedicated code-tracking repository integrated with your Jupyter notebooks.
Phase II: building container with model and deploying it
As an IT organisation you are certainly using some sort of git/CI tooling provider. Lets say it is Gitlab. You also almost certainly use containers (or you should!) and some sort of container registry. We are simply going to use this tooling, as we can safely assume that software developers/devops can setup the template data scientists can follow. Please see the example template for Python model container-builder on my GitHub.The idea is that: (1) model registry provides a way to download model binary by name/version, (2) the boilerplate code to run model is written by data scientist and is well tested, (3) CI pipeline will download model binary by version referenced in specific file and embed it into container, (4) final ambassador-ready container gets stored in container registry.
The resulting container with model can be attached to dedicated model-tester (ambassador-like application executing business suit of tests) or as a side-car to Jupyter notebook. Later it can get deployed for staging. Data science role is pretty self-sufficient so far and does not need to do what it is not skilled to do. So far so good.
Phase III: model trainer setup
Now, thats a bit wild. But fortunately the trainer setup is intuitively identical to the setup of the container builder. Only instead of model-binary version we have a corpus version (whatever it is, please imagine githash for now), coupled with dataset definition. We kind of have to touch something we were thinking abstractly about before.
If we use cloud blob storage for our training data, it means that one team member can change files while another team member is doing their work -- resulting in changes of values and all sorts of confusion/vaste of time. One of the neat ways to solve that problem is to use dvc -- the tool allowing to use blob storage in git-fashioned way. That would solve the parallelism of work problem, as people can reference hashes of their branches to avoid conflicts with each other. But still funny things can happen if we do not specify the exact data files we use for training -- like person A would delete most of the files in folder and it will get merged into master without B recognising it.
This is where the dataset definition file comes in. It can be seen as a filter applied to a particular version of corpus data and ensuring that passing data is clean to use. In computer vision problems it would mean that we can exclude image from training set without deleting it from corpus -- while maintaing the visibility of what files exactly were used to train the model. If your training data is in the data warehouse, assuming that immutability of data is warranted -- dataset definition can be a query used to run on that warehouse.
As you hopefully had a glimpse on the diagram above, you noticed that the model evaluation is built around Business Analysis dashboard tooling. And that each trainer run exports all of the data used for training/testing into some sort of data lake (AWS-based on this example). As we have mentioned above, for most of the models the evaluation figures alone are not sufficient to evaluate business value -- the lower accuracy cases might be the most business critical or vise versa. It leads to the need of a sophisticated tooling assisting this analysis and evolving together with this analysis.
Most of the processes are multi-stage. On the Natural Language Processing example, there are models splitting text into sententences, then tokens, then identifying concepts, then normalising and linking them (and many more!). What it means in practice is that there is a need for "integration evaluation", comparing end-to-end human vs machine performance. Here it is referenced as the One Metric Error Analysis, and it is also a very natural KPI, SLI for the team and business to monitor; SLA for clients to expect.
If we look into this stage of model development from the bird's eye view again, it makes sense: data analyst role can build and maintain dashboards; data engineer role can create a library for data export to the data storage of your choice; devops role can include this library into CI job template and will setup CI job nodes to include GPU/TPU spots or no spots depending on your budget and policies; data scientist will need to load the existing template of git repository and write the trainer code. Running the CI job will train model (or competing models) with the clear immutable dependency on dataset and store in model registry. Model evaluation can be performed in detail on dashboards with data scientist being the user of them. So far so good.
Ambassador-based model platform summary
We started from the overview of MLops complexity, high failure rate of AI projects; the remarkable distance between POC and MVP for AI prjects compared to those of software products. We hinted into the gap between what vendors promise and what you as a unique AI product have to be able to solve by yourself. No vendor can solve for you the evalution of valuability of AI for your business: mechanical model evaluation tooling and embedded accuracy graphs almost never translate directly to what is valuable exactly in your business case. This fact leads to the neccessity of having BI dashboards to have drill-down model analysis, telling the team what is actually valuable in model performance. What leads to a need to onboard data analyst and data engineer into the team.
Some tools, like Kubeflow, offer solutions to model training/deployment -- but those solutions are somewhat parallel to very similar processes for the software, require additional adapter layers and make devops life more complicated. Data export (required for model decay monitoring) is not that straightforward, as it has to be done from the container with the model itself. Application-level security becomes associated with the Data Science role, while they are (1) under big cognitive load already, (2) not trained for security perspective in the first place.
The Ambassador-based approach to hosting models allows to have iterative scaling of your processes as you go through the model development. It keeps your model in sync by utilising the same as simple as it gets tooling (git, CI pipelines, gitops deployment) across all roles -- allowing everybody to help each other and reducing the number of bottlenecks. Edgecase failures are debugged in the same way -- be it building the container with model job failure, or trainer job failure, or deployment job failure. We argue that the achieved reduction on the cognitive load on the team is of the biggest benefit and allows people to focus on the problem, while having professional self-sufficiency and very low deployment times.
Comments
Post a Comment