Topple old mindsets and thrive in AI projects: P.II

 

In the Part I of this series we've talked about the higher level aspects of software work when compared to that of data science. We noted the hidden "indoors-nature" assumptions plaguing software-driven mindsets: (1) that the problem is solvable [problem as a limited indoors space], (2) that if we have a POC we soon can have an MVP, (3) team size/composition influence on delivery [because the problem is finate, scaling workforce brings predictable results]. And we highlighted how it leads to a wreck, because of ML/AI problems worked on today typically being of an "outdoors-nature": (1) it is not known if the problem is solvable with acceptable accuracy values / data available, (2) POC are very easy to create, but for an enterprise-grade MVP very sophisticated machinery is needed, (3) team's cognitive overload to the degree stagnating any progress -- is much more likely when compared to that of software projects. 

Today we are going to talk about the commonalities/differences in a day to day work of data scientist compared to that of software developer. I hope it will increase awareness of how much space most of the organisations have to increase the comfort of work/efficiency of data scientists -- and as a result, the velocity of improvement and the quality of ML/AI products.

Serving & Deployment

Not sure if there is anything new to say about the comprehensively explored topic of software code serving and deployment. For our context what matters the most are: (1) source code is packed in some deployable package, (2) there are dozens of ways to run that code in the cloud/on-premises based on the project considerations, (3) there are various deployment tools and patterns, taking some sort of package/container and deploying it to the service point.

Source:  ml-ops.org

In somewhat simplified terms, ML/AI projects start with a subtle difference that there is a model binary which is executed by a boilerplate code. That binary can be imagined as an auto-generated program executable by specific software libraries. If you would imagine that you need to serve a software executing another software, you can use it as an analogy.

We say that models are retrained dynamically if the model binary is evolving online without requiring team's input; we talk about the static learning otherwise. Static learning can be simply seen as the resource deployed together with code, while dynamic one is either a higher-level automation of static learning or a very different paradigm altogether.

Lets limit our focus to the static learning, meaning problems/solutions for which the learning does not happen automatically in production. For this type of learning, we assume that model binaries are created by the team with the help of semi-automated tooling. Since model binary is executed by software and model binary is static -- it means that the software can load it as a resource (with a small caveat that we want this resource to be version controlled).

But things don't end there. Software is typically data format aware and data agnostic, meaning that software defines a tunnel for the data of sorts (tunnel being both the means and restrictions for the data). Models however are data sensitive. For the software it is a good practice not to log data; for models data is essential. Further still, with software we are used to the layered testing; if we apply the same approach to models, we put a hidden assumption on data scientists to know Devops/Testing -- something what they are typically not good at and what increases their cognitive load (in other words reduces productivity of data science itself). The exact same consideration applies to deployment tools and patterns. The deployment of models ranges from solutions like Kubeflow/Sagemaker [duplicating existing software processes but in super-customised tooling] to cross-team/cross-skill dependencies ["passing model on usb-drive" to a dedicated developer to deploy it].

So where all of this leads us to?

First of all, is that there is an extra middle artifact present in ML/AI projects. This artifact requires certain changes in the CD process. Data scientists are skilled to built that artifact, but there are limitations on how much they can do with the CD side of things. It follows that new ways of team interaction need to be introduced. Depending on the position of the slider in a particular team -- those interactions can range from the extreme "everybody in servitude of data science" to an extreme "data science should manage themselves". Both of those extremes are damaging for the organisation health and efficiency, so they need to be watched out for.

Secondly, most of the things we know about serving software are transferrable to serving models: we need infrastructure monitoring, application level logs/metrics; we can play with cloud functions serving models or we can put models into containers; we can use Spark or similar tooling for some of the Big Data scale applications. However we cannot assume that data science role can be efficient in their primary work and be good at understanding that level of the monitoring data. While most of the developers can handle an increase of cognitive load when they are asked to watch for infrastructure-level as well -- it is primarily because parts of that knowledge are already taught by universities/encountered in other form at work. For data scientists, even application level logging can be complex to follow -- due to all of the cognitive details kept in mind for ML/AI needs themselves. It leads to a need for the team to have a distribution of monitoring responsibilities in order to stay on top of things. Importantly, models require a data-level monitoring -- to detect things ranging from the natural data drift occuring over time to sophisticated data poisioning attacks by adversaries. The latter more often than not proves to be a big enterprise on its own, as to set things up we need to track what model version (trained on what data) encounters what kinds of data in production. Sensitivity of production data does not make things simpler. However lack of such monitoring for ML/AI is similar to skipping logging/alerting sub-systems in software projects. Hopefully nobody would seriously consider doing such a foolish thing, right?.. right?

Training & Packaging

On the high level, model training and packaging is pretty similar to CI/CD processes for software: (1) we update dependencies [libraries for software, train dataset+librariries for ML/AI]; (2) we commit and push our changes so that pipeline would build and test for us an artifact [code binaries/packages for software; model binaries for ML/AI]; (3) we integrate the change into executable form and run some final tests [containers for both]. When something flies like a duck, swims like a duck, makes a sound like a duck -- it is a duck. So this higher level perspective tells us that there is nothing extremely special about ML/AI, it is just a subtly different CI/CD process. But it is not an extreme which requires you or your organisation to use something what you don't know so far -- or pay licenses for CI/CD pipelines packed as ML/AI candies.

 

The difference is that ML/AI has one more step when compared to that of software. If it helps, lets call it "continuous training" (CT) and portray things as a CT/CI/CD chain:  

  1. CT is a pipeline training model or multiple competing models and storing binaries in model registry. The process executes model evaluation and stores metrics in model registry. For most of the industries underlying repository should track for accountability reasons:
    1. Data (corpus) version
    2. Dataset composition
    3. Hyperparameters used for training
    4. Code used for traing
  2. CI process is a pipeline taking a model binary from registry by its version and packing it and the boilerplate code into container. All of the standard testing procedures apply.
  3. CD process deploys the resulting package/container and is practically identical to the same process for the software.

The catch here is that unlike in software, model training/testing involves randomisation -- so for the exact same setting we are going to get subtly different results for each pipeline run. Even more importantly, for most of the mature enterprises we would like to train multiple competing models on the same data. As frameworks differ and as data evolves all the time, we want to deploy the best solution we can get. It implies that unlike in software, there is a manual intermidiate step of chosing a model. Ideally, there is a dedicated data analysis tooling which automatically shows the X-ray of freshly-trained model performance on the test set and highlights the perfomance on edge cases important for your business. Once the model is selected (or if it is a single model -- passes the review), a more familiar CI process can be initiated.

Lets summarise what we talked about here.

Processes/tools created for automation of software building are transferrable to those of ML/AI in practice. There is a new step in those processes and a need for a new type of registry (model registry), but it is all inline with the existing ways/intuitions of pipelines building, testing and packing software. I'm stressing this point as for some reasons on many occasions I saw it treated as some sort of a black magic by the team. I won't be surprised to see the major players like Bitbucket, Gitlab, Github introducing model registries for private use -- just like the functionality to store software library artifacts. You can judge my worlds for yourslef by comparing the Kubeflow pipelines graph to the processes we talked about above:

Kubeflow pipelines: an example of CI/CD tooling duplication
in disguise of making things simpler

What is different from the software is that we need to have a new extra pipeline (one extra crepository) for model training, this pipeline typically requires a specialised non-CPU instance. There is nothing new about needing a specialised instance for software building (nodding to you, cross-platform developers), so nothing scary here. What is different is the need for model registry, statistical evaluation of models instead of the green-red tests. There is also a need to setup processes in a way not requiring data-scientist to fix things they don't have skills to fix (flaky integration tests, code failures because of different model outputs..), but each team finds its way through this aspect eventually anyway.

Note: I'm not against Kubeflow and everything has its application in the right context. For the most part, I consider how team works practically with things. For big organisations tools like Kubeflow allow to create isolated data science teams. The costs of it is the maintenance, security and processes around a separate piece of tooling plus the hidden costs following from the ways teams interract. It is one of the feasible ways to control cognitive load of data scientists, but it is the way discounting the existing expertise of software and platform developers in the organisation.

Data cleaning & versioning

Its been said that most of the data science time is spent on data cleaning/wrangling. For people with software background it might be useful to remember that the training data to ML/AI projects is pretty much its source code. The one does not simply "puts the training data and gets the model". The dataset is essentially an exit poll, a sample of an infinite world based on which the model would make conclusions about the real world.

Data cleaning: a not-that-generic process of creating and maintaing data corpus.

If data is like a source code, it sound to me like a brilliant idea to treat it as such. There are nice tools like DVC, allowing to store the actual typically big data in cloud storage buckets but using all of the comforts of git ways of working. DVC provides data versioning and tracking, allows conflict detection when people modify things in parallel -- but still there is a need for an environment ("IDE") for data scientist to clean the data at. I haven't encountered a more general solution for this yet, usually teams develop some custom tooling/processes for their problem and use dvc commits to integrate data iterations with model development.

Importantly, as there is a git repository involved in data versioning -- the team is free and should create a set of tests for their data. Often those tests are simple python scripts checking the data for integrity -- for NLP problems they can check the correctness of annotations, for Vision/Sound problems there might be some spectrum checks; if train/test sets are defined manually you might want to ensure that same files are not used for training and testing. In practice the opportunity and the scope of useful data tests is huge and it is where QA specialists/software developer hunches proved to be very influential.

Afterword

While a lot of attention and talk in the IT is dedicated to technology and tech stack, I found that human interactions in the team taking into account cognitive load are detrimental to projects success. Typically, there is a lot of hype and complex terminology when the new technology appears -- and it is this way for the ML/AI field today. However just like everything before it is an evolution rather than a revolution. The mathematical fundamentals behind the computer science stay the same, the manufacture-style team processes also stay. Some things change, and in the case of the ML/AI those are: (1) clean data becomes a new "code" and needs relevant processes, (2) model training has an extra stage and needs a new type of registry, (3) we cannot assume that one role can handle all of the cross-stack details. As data scientists are soon to be a new normal in IT organisations, the addition of the new role to team interaction is most likely to shake existing ways of working quite a bit.

If you enter that new field with the right mindset you hopefully will find it making sense and just right.

Wish you all the best on your AI journey!

Comments

Popular Posts