{"id":557,"date":"2020-08-21T13:40:39","date_gmt":"2020-08-21T13:40:39","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/21\/data-science-meets-devops-mlops-with-jupyter-git-kubernetes\/"},"modified":"2020-08-21T13:40:39","modified_gmt":"2020-08-21T13:40:39","slug":"data-science-meets-devops-mlops-with-jupyter-git-kubernetes","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/21\/data-science-meets-devops-mlops-with-jupyter-git-kubernetes\/","title":{"rendered":"Data Science Meets Devops: MLOps with Jupyter, Git, &amp; Kubernetes"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/jeremy-lewi-600aaa8\/\" target=\"_blank\" rel=\"noopener noreferrer\">Jeremy Lewi<\/a>, Software Engineer at Google &amp; <a href=\"https:\/\/hamel.dev\/\" target=\"_blank\" rel=\"noopener noreferrer\">Hamel Husain<\/a>, <\/b><\/p>\n<p>\u00a0<\/p>\n<h3>The Problem<\/h3>\n<p>\u00a0<br \/><a href=\"https:\/\/www.kubeflow.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">Kubeflow<\/a>\u00a0is a fast-growing open source project that makes it easy to deploy and manage machine learning on Kubernetes.<\/p>\n<p>Due to Kubeflow\u2019s explosive popularity, we receive a large influx of GitHub issues that must be triaged and routed to the appropriate subject matter expert. The below chart illustrates the number of new issues opened for the past year:<\/p>\n<div>\n<img src=\"https:\/\/blog.kubeflow.org\/images\/2020-08-01-data-science-meets-devops\/fig1.num-issues.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><strong>Figure 1:<\/strong>\u00a0Number of Kubeflow Issues<\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>To keep up with this influx, we started investing in a Github App called\u00a0<a href=\"https:\/\/github.com\/marketplace\/issue-label-bot\" rel=\"noopener noreferrer\" target=\"_blank\">Issue Label Bot<\/a>\u00a0that used machine learning to auto label issues. Our\u00a0<a href=\"https:\/\/github.com\/marketplace\/issue-label-bot\" rel=\"noopener noreferrer\" target=\"_blank\">first model<\/a>\u00a0was trained using a collection of popular public repositories on GitHub and only predicted generic labels. Subsequently, we started using\u00a0<a href=\"https:\/\/cloud.google.com\/automl\/docs\" rel=\"noopener noreferrer\" target=\"_blank\">Google AutoML<\/a>\u00a0to train a Kubeflow specific model. The new model was able to predict Kubeflow specific labels with average precision of 72% and average recall of 50%. This significantly reduced the toil associated with issue management for Kubeflow maintainers. The table below contains evaluation metrics for Kubeflow specific labels on a holdout set. The\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\" rel=\"noopener noreferrer\" target=\"_blank\">precision and recall<\/a>\u00a0below coincide with prediction thresholds that we calibrated to suit our needs.<\/p>\n<table width=\"70%\" border=\"1\" cellspacing=\"2\" cellpadding=\"3\" class=\"wc\">\n<thead>\n<tr>\n<th>Label<\/th>\n<th>Precision<\/th>\n<th>Recall<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>area-backend<\/td>\n<td>0.6<\/td>\n<td>0.4<\/td>\n<\/tr>\n<tr>\n<td>area-bootstrap<\/td>\n<td>0.3<\/td>\n<td>0.1<\/td>\n<\/tr>\n<tr>\n<td>area-centraldashboard<\/td>\n<td>0.6<\/td>\n<td>0.6<\/td>\n<\/tr>\n<tr>\n<td>area-components<\/td>\n<td>0.5<\/td>\n<td>0.3<\/td>\n<\/tr>\n<tr>\n<td>area-docs<\/td>\n<td>0.8<\/td>\n<td>0.7<\/td>\n<\/tr>\n<tr>\n<td>area-engprod<\/td>\n<td>0.8<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>area-front-end<\/td>\n<td>0.7<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>area-frontend<\/td>\n<td>0.7<\/td>\n<td>0.4<\/td>\n<\/tr>\n<tr>\n<td>area-inference<\/td>\n<td>0.9<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>area-jupyter<\/td>\n<td>0.9<\/td>\n<td>0.7<\/td>\n<\/tr>\n<tr>\n<td>area-katib<\/td>\n<td>0.8<\/td>\n<td>1.0<\/td>\n<\/tr>\n<tr>\n<td>area-kfctl<\/td>\n<td>0.8<\/td>\n<td>0.7<\/td>\n<\/tr>\n<tr>\n<td>area-kustomize<\/td>\n<td>0.3<\/td>\n<td>0.1<\/td>\n<\/tr>\n<tr>\n<td>area-operator<\/td>\n<td>0.8<\/td>\n<td>0.7<\/td>\n<\/tr>\n<tr>\n<td>area-pipelines<\/td>\n<td>0.7<\/td>\n<td>0.4<\/td>\n<\/tr>\n<tr>\n<td>area-samples<\/td>\n<td>0.5<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>area-sdk<\/td>\n<td>0.7<\/td>\n<td>0.4<\/td>\n<\/tr>\n<tr>\n<td>area-sdk-dsl<\/td>\n<td>0.6<\/td>\n<td>0.4<\/td>\n<\/tr>\n<tr>\n<td>area-sdk-dsl-compiler<\/td>\n<td>0.6<\/td>\n<td>0.4<\/td>\n<\/tr>\n<tr>\n<td>area-testing<\/td>\n<td>0.7<\/td>\n<td>0.7<\/td>\n<\/tr>\n<tr>\n<td>area-tfjob<\/td>\n<td>0.4<\/td>\n<td>0.4<\/td>\n<\/tr>\n<tr>\n<td>platform-aws<\/td>\n<td>0.8<\/td>\n<td>0.5<\/td>\n<\/tr>\n<tr>\n<td>platform-gcp<\/td>\n<td>0.8<\/td>\n<td>0.6<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong><span>Table 1:<\/span><\/strong>\u00a0Evaluation metrics for various Kubeflow labels.<\/p>\n<p>\n\u00a0<\/p>\n<p>Given the rate at which new issues are arriving, retraining our model periodically became a priority. We believe continuously retraining and deploying our model to leverage this new data is critical to maintaining the efficacy of our models.<\/p>\n<p>\u00a0<\/p>\n<h3>Our Solution<\/h3>\n<p>\u00a0<br \/>Our CI\/CD solution is illustrated in\u00a0<a href=\"https:\/\/blog.kubeflow.org\/mlops\/#fig2\" rel=\"noopener noreferrer\" target=\"_blank\">Figure 2<\/a>. We don\u2019t explicitly create a directed acyclic graph (DAG) to connect the steps in an ML workflow (e.g. preprocessing, training, validation, deployment, etc\u2026). Rather, we use a set of independent controllers. Each controller declaratively describes the desired state of the world and takes actions necessary to make the actual state of the world match. This independence makes it easy for us to use whatever tools make the most sense for each step. More specifically we use<\/p>\n<ul>\n<li>Jupyter notebooks for developing models.\n<\/li>\n<li>GitOps for continuous integration and deployment.\n<\/li>\n<li>Kubernetes and managed cloud services for underlying infrastructure.\n<\/li>\n<\/ul>\n<div>\n<img src=\"https:\/\/blog.kubeflow.org\/images\/2020-08-01-data-science-meets-devops\/fig2.ci-cd.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><strong>Figure 2:<\/strong>\u00a0illustrates how we do CI\/CD. Our pipeline today consists of two independently operating controllers. We configure the Trainer (left hand side) by describing what models we want to exist; i.e. what it means for our models to be \u201cfresh\u201d. The Trainer periodically checks whether the set of trained models are sufficiently fresh and if not trains a new model. We likewise configure the Deployer (right hand side) to define what it means for the deployed model to be in sync with the set of trained models. If the correct model is not deployed it will deploy a new model.<\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>For more details on model training and deployment refer to the\u00a0<a href=\"https:\/\/blog.kubeflow.org\/mlops\/#actuation\" rel=\"noopener noreferrer\" target=\"_blank\">Actuation section below<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>Background<\/h3>\n<p>\u00a0<\/p>\n<h3>Building Resilient Systems With Reconcilers<\/h3>\n<p>\u00a0<br \/>A reconciler is a control pattern that has proven to be immensely useful for building resilient systems. The reconcile pattern is\u00a0<a href=\"https:\/\/book.kubebuilder.io\/cronjob-tutorial\/controller-overview.html\" rel=\"noopener noreferrer\" target=\"_blank\">at the heart of how Kubernetes works<\/a>. Figure 3 illustrates how a reconciler works. A reconciler works by first observing the state of the world; e.g. what model is currently deployed. The reconciler then compares this against the desired state of the world and computes the diff; e.g the model with label \u201cversion=20200724\u201d should be deployed, but the model currently deployed has label \u201cversion=20200700\u201d. The reconciler then takes the action necessary to drive the world to the desired state; e.g. open a pull request to change the deployed model.<\/p>\n<div>\n<img src=\"https:\/\/blog.kubeflow.org\/images\/2020-08-01-data-science-meets-devops\/fig3.reconciler.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><strong>Figure 3.<\/strong>\u00a0Illustration of the reconciler pattern as applied by our deployer.<\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Reconcilers have proven immensely useful for building resilient systems because a well implemented reconciler provides a high degree of confidence that no matter how a system is perturbed it will eventually return to the desired state.<\/p>\n<p>\u00a0<\/p>\n<h3>There is no DAG<\/h3>\n<p>\u00a0<br \/>The declarative nature of controllers means data can flow through a series of controllers without needing to explicitly create a DAG. In lieu of a DAG, a series of data processing steps can instead be expressed as a set of desired states, as illustrated in Figure 4 below:<\/p>\n<div>\n<img src=\"https:\/\/blog.kubeflow.org\/images\/2020-08-01-data-science-meets-devops\/fig4.data-pipeline.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><strong>Figure 4:<\/strong>\u00a0illustrates how pipelines can emerge from independent controllers without explicitly encoding a DAG. Here we have two completely independent controllers. The first controller ensures that for every element a<sub>i<\/sub>\u00a0there should be an element b<sub>i<\/sub>. The second controller ensures that for every element b<sub>i<\/sub>\u00a0there should be an element c<sub>i<\/sub>.<\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>This reconciler-based paradigm offers the following benefits over many traditional DAG-based workflows:<\/p>\n<ul>\n<li>\n<strong>Resilience against failures<\/strong>: the system continuously seeks to achieve and maintain the desired state.\n<\/li>\n<li>\n<strong>Increased autonomy of engineering teams:<\/strong>\u00a0each team is free to choose the tools and infrastructure that suit their needs. The reconciler framework only requires a minimal amount of coupling between controllers while still allowing one to write expressive workflows.\n<\/li>\n<li>\n<strong>Battle tested patterns and tools<\/strong>: This reconciler based framework does not invent something new. Kubernetes has a rich ecosystem of tools that aim to make it easy to build controllers. The popularity of Kubernetes means there is a large and growing community familiar with this pattern and supporting tools.\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>GitOps: Operation By Pull Request<\/h3>\n<p>\u00a0<br \/>GitOps, Figure 5, is a pattern for managing infrastructure. The core idea of GitOps is that source control (doesn\u2019t have to be git) should be the source of truth for configuration files describing your infrastructure. Controllers can then monitor source control and automatically update your infrastructure as your config changes. This means to make a change (or undo a change) you just open a pull request.<\/p>\n<div>\n<img src=\"https:\/\/blog.kubeflow.org\/images\/2020-08-01-data-science-meets-devops\/fig5.gitops.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<div class=\"caption\">\n<strong>Figure 5:<\/strong>\u00a0To push a new model for Label Bot we create a PR updating the config map storing the id of the Auto ML model we want to use. When the PR is merged,\u00a0<br \/>\n<a href=\"https:\/\/cloud.google.com\/anthos-config-management\/docs\" rel=\"noopener noreferrer\" target=\"_blank\">Anthos Config Management(ACM<\/a>) automatically rolls out those changes to our GKE cluster. As a result, subsequent predictions are made using the new model. (Image courtesy of\u00a0<br \/>\n<a href=\"https:\/\/www.weave.works\/blog\/automate-kubernetes-with-gitops\" rel=\"noopener noreferrer\" target=\"_blank\">Weaveworks<\/a>)\n<\/div>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<h3>Putting It Together: Reconciler + GitOps = CI\/CD for ML<\/h3>\n<p>\u00a0<br \/>With that background out of the way, let\u2019s dive into how we built CI\/CD for ML by combining the Reconciler and GitOps patterns.<\/p>\n<p>There were three problems we needed to solve:<\/p>\n<ol>\n<li>How do we compute the diff between the desired and actual state of the world?\n<\/li>\n<li>How do we affect the changes needed to make the actual state match the desired state?\n<\/li>\n<li>How do we build a control loop to continuously run 1 &amp; 2?\n<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<h3>Computing Diffs<\/h3>\n<p>\u00a0<br \/>To compute the diffs we just write lambdas that do exactly what we want. So in this case we wrote two lambdas:<\/p>\n<ol>\n<li>The\u00a0<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/blob\/faeb65757214ac93259f417b81e9e2fedafaebda\/Label_Microservice\/go\/cmd\/automl\/pkg\/server\/server.go#L109\" rel=\"noopener noreferrer\" target=\"_blank\">first lambda<\/a>\u00a0determines whether we need to retrain based on the age of the most recent model.\n<\/li>\n<li>The\u00a0<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/blob\/faeb65757214ac93259f417b81e9e2fedafaebda\/Label_Microservice\/go\/cmd\/automl\/pkg\/server\/server.go#L49\" rel=\"noopener noreferrer\" target=\"_blank\">second lambda<\/a>\u00a0determines whether the model needs to be updated by comparing the most recently trained model to the model listed in a config map checked into source control.\n<\/li>\n<\/ol>\n<p>We wrap these lambdas in a simple web server and deploy on Kubernetes. One reason we chose this approach is because we wanted to rely on Kubernetes\u2019\u00a0<a href=\"https:\/\/github.com\/kubernetes\/git-sync\" rel=\"noopener noreferrer\" target=\"_blank\">git-sync<\/a>\u00a0to mirror our repository to a pod volume. This makes our lambdas super simple because all the git management is taken care of by a side-car running\u00a0<a href=\"https:\/\/github.com\/kubernetes\/git-sync\" rel=\"noopener noreferrer\" target=\"_blank\">git-sync<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>Actuation<\/h3>\n<p>\u00a0<br \/>To apply the changes necessary, we use Tekton to glue together various CLIs that we use to perform the various steps.<\/p>\n<p>\u00a0<\/p>\n<h3>Model Training<\/h3>\n<p>\u00a0<br \/>To train our model we have a\u00a0<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/blob\/faeb65757214ac93259f417b81e9e2fedafaebda\/tekton\/tasks\/run-notebook-task.yaml#L34\" rel=\"noopener noreferrer\" target=\"_blank\">Tekton task\u00a0<\/a>that:<\/p>\n<ol>\n<li>Runs our notebook using\u00a0<a href=\"https:\/\/github.com\/nteract\/papermill\" rel=\"noopener noreferrer\" target=\"_blank\">papermill<\/a>.\n<\/li>\n<li>Converts the notebook to html using\u00a0<a href=\"https:\/\/nbconvert.readthedocs.io\/en\/latest\/\" rel=\"noopener noreferrer\" target=\"_blank\">nbconvert<\/a>.\n<\/li>\n<li>Uploads the\u00a0<code>.ipynb<\/code>\u00a0and\u00a0<code>.html<\/code>\u00a0files to GCS using\u00a0<a href=\"https:\/\/cloud.google.com\/storage\/docs\/gsutil\" rel=\"noopener noreferrer\" target=\"_blank\">gsutil<\/a>\n<\/li>\n<\/ol>\n<p>This notebook fetches GitHub Issues data\u00a0<a href=\"https:\/\/medium.com\/google-cloud\/analyzing-github-issues-and-comments-with-bigquery-c41410d3308\" rel=\"noopener noreferrer\" target=\"_blank\">from BigQuery<\/a>\u00a0and generates CSV files on GCS suitable for import into\u00a0<a href=\"https:\/\/cloud.google.com\/automl\" rel=\"noopener noreferrer\" target=\"_blank\">Google AutoML<\/a>. The notebook then launches an\u00a0<a href=\"https:\/\/cloud.google.com\/automl\" rel=\"noopener noreferrer\" target=\"_blank\">AutoML<\/a>\u00a0job to train a model.<\/p>\n<p>We chose AutoML because we wanted to focus on building a complete end to end solution rather than iterating on the model. AutoML provides a competitive baseline that we may try to improve upon in the future.<\/p>\n<p>To easily view the executed notebook we convert it to html and upload it to\u00a0<a href=\"https:\/\/cloud.google.com\/storage\/docs\/hosting-static-website\" rel=\"noopener noreferrer\" target=\"_blank\">GCS which makes it easy to serve public, static content<\/a>. This allows us to use notebooks to generate rich visualizations to evaluate our model.<\/p>\n<p>\u00a0<\/p>\n<h3>Model Deployment<\/h3>\n<p>\u00a0<br \/>To deploy our model we have a\u00a0<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/blob\/faeb65757214ac93259f417b81e9e2fedafaebda\/tekton\/tasks\/update-model-pr-task.yaml#L68\" rel=\"noopener noreferrer\" target=\"_blank\">Tekton task<\/a>\u00a0that:<\/p>\n<ol>\n<li>Uses kpt to update our configmap with the desired value.\n<\/li>\n<li>Runs git to push our changes to a branch.\n<\/li>\n<li>Uses a wrapper around the\u00a0<a href=\"https:\/\/github.com\/cli\/cli\" rel=\"noopener noreferrer\" target=\"_blank\">GitHub CLI<\/a>\u00a0(gh) to create a PR.\n<\/li>\n<\/ol>\n<p>The controller ensures there is only one Tekton pipeline running at a time. We configure our pipelines to always push to the same branch. This ensures we only ever open one PR to update the model because GitHub doesn\u2019t allow multiple PRs to be created from the same branch.<\/p>\n<p>Once the PR is merged\u00a0<a href=\"https:\/\/cloud.google.com\/anthos\/config-management\" rel=\"noopener noreferrer\" target=\"_blank\">Anthos Config Mesh<\/a>\u00a0automatically applies the Kubernetes manifests to our Kubernetes cluster.<\/p>\n<p>\u00a0<\/p>\n<h3>Why Tekton<\/h3>\n<p>\u00a0<br \/>We picked Tekton because the primary challenge we faced was sequentially running a series of CLIs in various containers. Tekton is perfect for this. Importantly, all the steps in a Tekton task run on the same pod which allows data to be shared between steps using a pod volume.<\/p>\n<p>Furthermore, since Tekton resources are Kubernetes resources we can adopt the same GitOps pattern and tooling to update our pipeline definitions.<\/p>\n<p>\u00a0<\/p>\n<h3>The Control Loop<\/h3>\n<p>\u00a0<br \/>Finally, we needed to build a control loop that would periodically invoke our lambdas and launch our Tekton pipelines as needed. We used kubebuilder to create a\u00a0<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/tree\/master\/Label_Microservice\/go\" rel=\"noopener noreferrer\" target=\"_blank\">simple custom controller<\/a>. Our controller\u2019s reconcile loop will call our lambda to determines whether a sync is needed and if so with what parameters. If a sync is needed the controller fires off a Tekton pipeline to perform the actual update. An example of our\u00a0<a href=\"https:\/\/kubernetes.io\/docs\/concepts\/extend-kubernetes\/api-extension\/custom-resources\/\" rel=\"noopener noreferrer\" target=\"_blank\">custom resource<\/a>\u00a0is illustrated below:<\/p>\n<div>\n<pre><code>apiVersion: automl.cloudai.kubeflow.org\/v1alpha1\r\nkind: ModelSync\r\nmetadata:\r\n  name: modelsync-sample\r\n  namespace: label-bot-prod\r\nspec:\r\n  failedPipelineRunsHistoryLimit: 10\r\n  needsSyncUrl: http:\/\/labelbot-diff.label-bot-prod\/needsSync\r\n  parameters:\r\n  - needsSyncName: name\r\n    pipelineName: automl-model\r\n  pipelineRunTemplate:\r\n    spec:\r\n      params:\r\n      - name: automl-model\r\n        value: notavlidmodel\r\n      - name: branchName\r\n        value: auto-update\r\n      - name: fork\r\n        value: git@github.com:kubeflow\/code-intelligence.git\r\n      - name: forkName\r\n        value: fork\r\n      pipelineRef:\r\n        name: update-model-pr\r\n      resources:\r\n      - name: repo\r\n        resourceSpec:\r\n          params:\r\n          - name: url\r\n            value: https:\/\/github.com\/kubeflow\/code-intelligence.git\r\n          - name: revision\r\n            value: master\r\n          type: git\r\n      serviceAccountName: auto-update\r\n  successfulPipelineRunsHistoryLimit: 10\r\n\r\n<\/code><\/pre>\n<\/div>\n<p>The custom resource specifies the endpoint,\u00a0<strong>needsSyncUrl<\/strong>, for the lambda that computes whether a sync is needed and a Tekton PipelineRun,\u00a0<strong>pipelineRunTemplate<\/strong>, describing the pipeline run to create when a sync is needed. The controller takes care of the details; e.g. ensuring only 1 pipeline per resource is running at a time, garbage collecting old runs, etc\u2026 All of the heavy lifting is taken care of for us by Kubernetes and kubebuilder.<\/p>\n<p>Note, for historical reasons the kind,\u00a0<strong>ModelSync<\/strong>, and apiVersion\u00a0<strong>automl.cloudai.kubeflow.org<\/strong>\u00a0are not reflective of what the controller actually does. We plan on fixing this in the future.<\/p>\n<p>\u00a0<\/p>\n<h3>Build Your Own CI\/CD pipelines<\/h3>\n<p>\u00a0<br \/>Our code base is a long way from being polished, easily reusable tooling. Nonetheless it is all public and could be a useful starting point for trying to build your own pipelines.<\/p>\n<p>Here are some pointers to get you started:<\/p>\n<ol>\n<li>Use the Dockerfile to build your own\u00a0<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/blob\/master\/Label_Microservice\/go\/Dockerfile\" rel=\"noopener noreferrer\" target=\"_blank\">ModelSync controller<\/a>\n<\/li>\n<li>\n<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/tree\/master\/Label_Microservice\/go\/config\/default\" rel=\"noopener noreferrer\" target=\"_blank\">Modify the kustomize package<\/a>\u00a0to use your image and deploy the controller\n<\/li>\n<li>Define one or more lambdas as needed for your use cases\n<ul>\n<li>You can use our\u00a0<a href=\"https:\/\/github.com\/kubeflow\/code-intelligence\/blob\/master\/Label_Microservice\/go\/cmd\/automl\/pkg\/server\/server.go\" rel=\"noopener noreferrer\" target=\"_blank\">Lambda server<\/a>\u00a0as an example\n<\/li>\n<li>We wrote ours in go but you can use any language and web framework you like (e.g. flask)\n<\/li>\n<\/ul>\n<\/li>\n<li>Define Tekton pipelines suitable for your use cases; our pipelines(linked below) might be a useful starting point\n<\/li>\n<li>Define ModelSync resources for your use case; you can refer to ours as an example\n<\/li>\n<\/ol>\n<p>If you\u2019d like to see us clean it up and include it in a future Kubeflow release please chime in on issue\u00a0<a href=\"https:\/\/github.com\/kubeflow\/kubeflow\/issues\/5167\" rel=\"noopener noreferrer\" target=\"_blank\">kubeflow\/kubeflow#5167<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>What\u2019s Next<\/h3>\n<p>\u00a0<\/p>\n<h3>Lineage Tracking<\/h3>\n<p>\u00a0<br \/>Since we do not have an explicit DAG representing the sequence of steps in our CI\/CD pipeline understanding the lineage of our models can be challenging. Fortunately, Kubeflow Metadata solves this by making it easy for each step to record information about what outputs it produced using what code and inputs. Kubeflow metadata can easily recover and plot the lineage graph. The figure below shows an example of the lineage graph from our\u00a0<a href=\"https:\/\/github.com\/kubeflow\/examples\/blob\/master\/xgboost_synthetic\/build-train-deploy.ipynb\" rel=\"noopener noreferrer\" target=\"_blank\">xgboost example<\/a>.<\/p>\n<div>\n<img src=\"https:\/\/blog.kubeflow.org\/images\/2020-08-01-data-science-meets-devops\/fig6.lineage.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<div class=\"caption\">\n<strong>Figure 6:<\/strong>\u00a0screenshot of the lineage tracking UI for our\u00a0<br \/>\n<a href=\"https:\/\/github.com\/kubeflow\/examples\/blob\/master\/xgboost_synthetic\/build-train-deploy.ipynb\" rel=\"noopener noreferrer\" target=\"_blank\">xgboost example<\/a>.\n<\/div>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Our plan is to have our controller automatically write lineage tracking information to the metadata server so we can easily understand the lineage of what\u2019s in production.<\/p>\n<p>\u00a0<\/p>\n<h3>Conclusion<\/h3>\n<p>\u00a0<br \/><img alt=\"alt_text\" class=\"aligncenter\" src=\"https:\/\/blog.kubeflow.org\/images\/2020-08-01-data-science-meets-devops\/meme.png\" width=\"100%\"><\/p>\n<p>Building ML products is a team effort. In order to move a model from a proof of concept to a shipped product, data scientists and devops engineers need to collaborate. To foster this collaboration, we believe it is important to allow data scientists and devops engineers to use their preferred tools. Concretely, we wanted to support the following tools for Data Scientists, Devops Engineers, and\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Site_Reliability_Engineering\" rel=\"noopener noreferrer\" target=\"_blank\">SRE<\/a>s:<\/p>\n<ul>\n<li>Jupyter notebooks for developing models.\n<\/li>\n<li>GitOps for continuous integration and deployment.\n<\/li>\n<li>Kubernetes and managed cloud services for underlying infrastructure.\n<\/li>\n<\/ul>\n<p>To maximize each team\u2019s autonomy and reduce dependencies on tools, our CI\/CD process follows a decentralized approach. Rather than explicitly define a DAG that connects the steps, our approach relies on a series of controllers that can be defined and administered independently. We think this maps naturally to enterprises where responsibilities might be split across teams; a data engineering team might be responsible for turning weblogs into features, a modeling team might be responsible for producing models from the features, and a deployments team might be responsible for rolling those models into production.<\/p>\n<p>\u00a0<\/p>\n<h3>Further Reading<\/h3>\n<p>\u00a0<br \/>If you\u2019d like to learn more about GitOps we suggest this\u00a0<a href=\"https:\/\/www.weave.works\/technologies\/gitops\/\" rel=\"noopener noreferrer\" target=\"_blank\">guide<\/a>\u00a0from Weaveworks.<\/p>\n<p>To learn how to build your own Kubernetes controllers the\u00a0<a href=\"https:\/\/book.kubebuilder.io\/\" rel=\"noopener noreferrer\" target=\"_blank\">kubebuilder book<\/a>\u00a0walks through an E2E example.<\/p>\n<p>\u00a0<br \/><b><a href=\"https:\/\/www.linkedin.com\/in\/jeremy-lewi-600aaa8\/\" target=\"_blank\" rel=\"noopener noreferrer\">Jeremy Lewi<\/a><\/b> is a Software Engineer at Google.<\/p>\n<p><b><a href=\"https:\/\/hamel.dev\/\" target=\"_blank\" rel=\"noopener noreferrer\">Hamel Husain<\/a><\/b> is a Staff Machine Learning Engineer @ GitHub.<\/p>\n<p><a href=\"https:\/\/blog.kubeflow.org\/mlops\/\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/08\/data-science-meets-devops-mlops-jupyter-git-kubernetes.html<\/p>\n","protected":false},"author":0,"featured_media":558,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/557"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=557"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/557\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/558"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=557"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=557"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=557"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}