{"id":738,"date":"2020-08-26T18:28:18","date_gmt":"2020-08-26T18:28:18","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/26\/data-versioning-does-it-mean-what-you-think-it-means\/"},"modified":"2020-08-26T18:28:18","modified_gmt":"2020-08-26T18:28:18","slug":"data-versioning-does-it-mean-what-you-think-it-means","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/26\/data-versioning-does-it-mean-what-you-think-it-means\/","title":{"rendered":"Data Versioning: Does it mean what you think it means?"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/einat-orr-359ba6\/?originalSubdomain=il\" target=\"_blank\" rel=\"noopener noreferrer\">Einat Orr, PhD.<\/a>, Co founder &amp; CEO at Treeverse<\/b><\/p>\n<p>When we first thought about a tagline for lakeFS, our recently released OSS project, we instinctively used terms such as \u201cData versioning\u201d, \u201cManage data the way you manage code\u201d, \u201cIt\u2019s git for data\u201d, and any random variation of the three that is a grammatically correct sentence in english. We were very pleased with ourselves for 5 minutes, maybe 7, before realizing these phrases don\u2019t really mean anything, or mean too many things to properly describe what value we bring. It is also commonly used by other players in the domain that address completely different use cases.<\/p>\n<p>We decided to map the world of projects declaring \u201c<em>Data Versioning<\/em>\u201d, \u201c<em>Manage data the way you manage code<\/em>\u201d, and \u201c<em>It\u2019s Git for Data<\/em>\u201d according to use cases.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Use Case #1: Collaboration over data\u00a0<\/strong><\/h3>\n<p>\u00a0<br \/><strong>The pain:<\/strong>\u00a0Data analysts and data scientists using many data sets, external and internal, that change over time. Managing access to data sets, and the different versions of each data set over time, is hard and error prone.<\/p>\n<p><strong>The solution:<\/strong>\u00a0An interface that allows collaboration over the data and version management o. The actual repository may be a proprietary database (e.g.\u00a0<a href=\"https:\/\/www.dolthub.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">DoltHub<\/a>), or providing efficient access to data distributed within your systems (e.g.\u00a0<a href=\"https:\/\/quiltdata.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">Quilt\u00a0<\/a>or\u00a0<a href=\"https:\/\/www.splitgraph.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">Splitgraph<\/a>). These interfaces grant easy access and management of different versions of the same data set. Most players in this category also provide collaboration of other aspects of the workflow. Most popular is the ability to collaborate over ML models. In this category you can find the likes of\u00a0<a href=\"https:\/\/dagshub.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">DAGsHub<\/a>, DoltHub,\u00a0<a href=\"https:\/\/data.world\/\" rel=\"noopener noreferrer\" target=\"_blank\">data.world<\/a>,\u00a0<a href=\"https:\/\/www.kaggle.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">Kaggle<\/a>, Splitgraph, Quilt,\u00a0<a href=\"https:\/\/www.floydhub.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">FloydHub<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.datalad.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">DataLad<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Use Case #2: Managing ML pipelines<\/strong><\/h3>\n<p>\u00a0<br \/><strong>The pain:<\/strong>\u00a0Running ML pipelines, from input data to tagged data, validation, modeling, optimizing hyperparameters, and introducing the models to production. There\u2019s no simple way to manage this pipeline, and the very many tools used in the process.<\/p>\n<p><strong>\u00a0The solution<\/strong>: MLOps tools. At this point you might be asking yourself, why would Ops tools be mentioned in the context of the \u201cData Versioning\u201d? Well, it\u2019s because managing\u00a0 data pipelines is a major challenge in ML application life cycle. Since ML is a scientific work, it requires reproducibility, and reproducibility means data + code. There are a few MLOps tools that enable data versioning, and they include:\u00a0<a href=\"https:\/\/dvc.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">DVC<\/a>,\u00a0<a href=\"https:\/\/www.pachyderm.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">Pachyderm\u00a0<\/a>and\u00a0<a href=\"https:\/\/mlflow.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">MLflow<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Use Case #3: The need for Insert and Delete in immutable data lakes<\/strong><\/h3>\n<p>\u00a0<br \/><strong>The pain:<\/strong>\u00a0Data-lakes over object-storage are immutable (both objects and formats), but mutability is essential to:<\/p>\n<ol>\n<li>Comply with GDPR and other privacy regulation (delete records on demand)\n<\/li>\n<li>Ingest streaming data (requires appends)\n<\/li>\n<li>Backfills or late data (require updates to already saved data).\n<\/li>\n<\/ol>\n<p><strong>The solution:<\/strong>\u00a0Structured Data Formats that allow Insert, Delete, and Upsert. The formats are columnar, and provide the ability to change an existing object by saving the delta of the changes into another object. The meta data of those objects include the instructions on how to generate the latest version of an object from its saved delta objects. We add data versioning mainly to provide concurrency control. In\u00a0 this category you can find open source projects\u00a0<a href=\"https:\/\/iceberg.apache.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">Apache IceBerg<\/a>,\u00a0<a href=\"https:\/\/hudi.apache.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">Apache Hudi<\/a>\u00a0and\u00a0<a href=\"https:\/\/delta.io\/\" rel=\"noopener noreferrer\" target=\"_blank\">Delta Lake<\/a>\u00a0(by\u00a0<a href=\"https:\/\/databricks.com\/\" rel=\"noopener noreferrer\" target=\"_blank\">DataBricks<\/a>).<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Use Case #4: Data lake manageability and resilience<\/strong><\/h3>\n<p>\u00a0<br \/><strong>The pain:\u00a0<\/strong>Managing multiple data producers and consumers of an object storage based data lake. The consumers access the data using different tools, such as Hadoop\/Spark, Presto, and analytics data bases. Coordination between the data contributors and data consumers is challenging. It relies on internal processes and manual updates of catalogs or files. In addition, there\u2019s no easy way to provide isolation without copying data, and there is no way to ensure consistency between multiple data collections.<\/p>\n<p><strong>The solution:<\/strong>\u00a0An interface that allows collaboration over the data and version management. For example, the interface can provide a Git terminology that allows versioning of the lake by\u00a0<a href=\"https:\/\/docs.lakefs.io\/branching\/model.html\" rel=\"noopener noreferrer\" target=\"_blank\">branching<\/a>, committing and merging changes.<\/p>\n<p>We created lakeFS after meeting over 30 companies managing a data lake over an object storage. These pains, that we knew well\u00a0<a href=\"https:\/\/lakefs.io\/2020\/08\/03\/introducing-lakefs\/\" rel=\"noopener noreferrer\" target=\"_blank\">from our own experience<\/a>, kept coming up. lakeFS is designed to deliver resilience and manageability to object storage data lakes. It is format agnostic and supports all formats supporting mutability.<\/p>\n<div>\n<img src=\"https:\/\/lakefs.io\/wp-content\/uploads\/2020\/08\/Data-Versioning-Tools.-1.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p>Example of Data Versioning tools<\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>\u00a0<br \/><b>Bio: <a href=\"https:\/\/www.linkedin.com\/in\/einat-orr-359ba6\/?originalSubdomain=il\" target=\"_blank\" rel=\"noopener noreferrer\">Einat Orr, PhD.<\/a><\/b> is Co founder and Chief Executive Officer at Treeverse.<\/p>\n<p><a href=\"https:\/\/lakefs.io\/2020\/08\/10\/data-versioning\/\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/08\/data-versioning-mean-think-means.html<\/p>\n","protected":false},"author":0,"featured_media":739,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/738"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=738"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/738\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/739"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=738"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=738"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=738"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}