2 minute read

Running and maintaining a machine learning (ML) lifecycle on production is not an easy endeavor. Depending on the scale of our project, the activities can be quite challenging, such as:

  1. ML project and pipeline setup for development, training, testing, and deployment - CI/CD (continuous integration and continuous delivery)
  2. Managing all ML models, such as versioning, ab-tested models, features used, the logging of training/testing, etc.
  3. Managing features, including engineering pipeline and versioning.
  4. Special pipelines for model experimentation, for instance, AB-testing ML models.
  5. Managing experiments, including measurements for each experiment.
  6. Maintain the whole pipeline to keep working as intended
  7. Running and maintaining operational aspects, e.g., current running models, currently running experiments, metrics, logging, and infrastructure.

I have collaborated with data scientists on delivering an ML project. We have Google Cloud Platform (GCP) tech-stacks as leverage, yet it was very involved. Although GCP has a broad set of tools, there was no such thing as automated ML lifecycle management. I (and my team) had to layout the foundation and decide on each step. For instance, we had to choose the tech-stack while keeping GDPR (General Data Protection Regulation) compliant. It is important since infrastructure influences the choices of the underlying technologies/framework.

Overall, it was tedious but enjoyable since we learned a lot. However, it was not time-to-market friendly. For instance, we spent at least two months to set up an ML project and pipeline from feature extraction, engineering, training, testing, deployment, versioning, minus running in production. Moreover, we had to build the ML pipeline to support experimentation, which was non-trivial too.

There is a term for applying DevOps principles to manage ML lifecycle called MLOps. There are tools for MLOps with varying scopes and focuses, Kubeflow, MLflow, Pachyderm to name a few. Kubeflow is an ML workflow orchestration tool on top of Kubernetes and comparable to more generic job orchestration counterparts, such as Airflow. On the other hand, MLflow is a package that focuses on model management from training, experimentation, etc.

Following a Google MLOps Workshop, I came to know that Google opinionated approach on MLOps relies on Kubeflow. The idea is to provide orchestration mechanic to access and execute different types of jobs on GCP stacks, such as Google Dataflow, Google Compute Engine, Google Kubernetes Engine, Google AI Platform via Kubernetes. It also tracks model training (and feature store), versioning, and deployment. There are runnable examples available at Github. Although I couldn’t see immediate benefits compared to using Airflow for orchestration (supplemented with MLflow), the Google-managed automated features and model management are indeed useful if we commit to GCP stack solutions.

Have fun coding!

Comments