{"id":8040,"date":"2020-12-29T00:26:46","date_gmt":"2020-12-29T00:26:46","guid":{"rendered":"https:\/\/healinglifespan.com\/data-science\/2020\/12\/29\/data-catalogs-are-dead-long-live-data-discovery\/"},"modified":"2020-12-29T00:26:46","modified_gmt":"2020-12-29T00:26:46","slug":"data-catalogs-are-dead-long-live-data-discovery","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/12\/29\/data-catalogs-are-dead-long-live-data-discovery\/","title":{"rendered":"Data Catalogs Are Dead; Long Live Data Discovery"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <!--author-start-->Debashis Saha &amp; Barr Moses<!--author-end--><\/b><\/p>\n<div><img src=\"https:\/\/miro.medium.com\/max\/1200\/0*1zO0b12HtJxyrXIh\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span><\/div>\n<p>\u00a0<\/p>\n<p><em>As companies increasingly leverage data to power digital products, drive decision making, and fuel innovation, understanding the health and reliability of these most critical assets is fundamental. For decades, organizations have relied on data catalogs to power data governance. But is that enough?<\/em><\/p>\n<p><a href=\"https:\/\/www.linkedin.com\/in\/debashis1saha\" rel=\"noopener noreferrer\" target=\"_blank\"><em>Debashis Saha<\/em><\/a><em>, VP of Engineering at AppZen, formerly at eBay and Intuit, and\u00a0<\/em><a href=\"https:\/\/www.linkedin.com\/in\/barrmoses\" rel=\"noopener noreferrer\" target=\"_blank\"><em>Barr Moses<\/em><\/a><em>, CEO and Co-founder of Monte Carlo, discuss why data catalogs aren\u2019t meeting the needs of the modern data stack, and how a new approach \u2014 data discovery \u2014 is needed to better facilitate metadata management and data reliability.<\/em><\/p>\n<p>It\u2019s\u00a0no secret: knowing where your data lives and who has access to it is fundamental to understanding its impact on your business. In fact, when it comes to building\u00a0<a href=\"https:\/\/www.montecarlodata.com\/how-to-build-your-data-platform-like-a-product\/\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>a successful data platform<\/strong><\/a>, it\u2019s critical that your data is both organized and centralized, while also easily discoverable.<\/p>\n<p>Analogous to a physical library catalog,\u00a0<a href=\"https:\/\/cloud.google.com\/data-catalog\" rel=\"noopener noreferrer\" target=\"_blank\">data catalogs<\/a>\u00a0serve as an inventory of metadata and give users the information necessary to evaluate data accessibility, health, and location. In our age of\u00a0<a href=\"https:\/\/searchbusinessanalytics.techtarget.com\/definition\/self-service-business-intelligence-BI\" rel=\"noopener noreferrer\" target=\"_blank\">self-service business intelligence<\/a>, data catalogs have also emerged as a powerful tool for data management and data governance.<\/p>\n<p>Not surprisingly, for most data leaders, one of their first imperatives is to build a data catalog.<\/p>\n<p>At the bare minimum, a data catalog should answer:<\/p>\n<ul>\n<li>Where should I look for my data?\n<\/li>\n<li>Does this data matter?\n<\/li>\n<li>What does this data represent?\n<\/li>\n<li>Is this data relevant and important?\n<\/li>\n<li>How can I use this data?\n<\/li>\n<\/ul>\n<p>Still, as data operations mature and data pipelines become increasingly complex, traditional data catalogs often fall short of meeting these requirements.<\/p>\n<blockquote>\n<p>\n<strong>Here\u2019s why some of the best data engineering teams are innovating their approach to metadata management \u2014 and what they\u2019re doing instead:<\/strong>\n<\/p>\n<\/blockquote>\n<p>\u00a0<\/p>\n<h3>Where data catalogs fall short<\/h3>\n<p>\u00a0<br \/>While data catalogs have the ability to document data, the fundamental challenge of allowing users to \u201cdiscover\u201d and glean meaningful, real-time insights about the health of your data has largely remained unsolved.<\/p>\n<p>Data catalogs as we know them are unable to keep pace with this new reality for three primary reasons: (1) lack of automation, (2) inability to scale with the growth and diversity of your data stack, and (3)\u00a0their undistributed format.<\/p>\n<p>\u00a0<\/p>\n<h3>Increased need for automation<\/h3>\n<p>\u00a0<br \/>Traditional data catalogs and governance methodologies typically rely on data teams to do the heavy lifting of manual data entry, holding them responsible for updating the catalog as data assets evolve. This approach is not only time-intensive, but requires significant manual toil that could otherwise be automated, freeing time up for data engineers and analysts to focus on projects that actually move the needle.<\/p>\n<p>As a data professional, understanding the state of your data is a constant battle and speaks to the need for greater, more customized automation. Perhaps this scenario rings a bell:<\/p>\n<p>Before stakeholder meetings, do you often find yourself frantically pinging Slack channels to figure out what data sets feed a specific report or model you are using \u2014 and why on earth the data stopped arriving last week? To cope with this, do you and your team huddle together in a room and start whiteboarding all of the various connections upstream and downstream for a specific key report?<\/p>\n<p>I\u2019ll spare you the gory details, but it probably looked something like this:<\/p>\n<div><img src=\"https:\/\/miro.medium.com\/max\/1200\/0*bAQL8xXipQ7Rx_BS\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<div class=\"caption\"><em>Does your data lineage look like a storm of lines and arrows? That makes two (hundred) of us. Image courtesy of\u00a0<\/em><a href=\"https:\/\/www.shutterstock.com\/g\/EgudinKa\" rel=\"noopener noreferrer\" target=\"_blank\">EgudinKa<\/a>\u00a0on\u00a0<a href=\"http:\/\/www.shutterstock.com\/\" rel=\"noopener noreferrer\" target=\"_blank\"><em>Shutterstock<\/em><\/a><em>.<\/em><\/div>\n<p><\/span><\/div>\n<p>\u00a0<\/p>\n<p>If this hits home, you\u2019re not alone. Many companies that need to solve this dependency jigsaw puzzle embark on a multi-year process to manually map out all their data assets. Some are able to dedicate resources to build short-term hacks or even in-house tools that allow them to search and explore their data. Even if it gets you to the end goal, this poses a heavy burden on the data organization, costing your data engineering team time and money that could have been spent on other things, like product development or actually using the data.<\/p>\n<p>\u00a0<\/p>\n<h3>Ability to scale as data changes<\/h3>\n<p>\u00a0<br \/>Data catalogs work well when data is structured, but in 2020, that\u2019s not always the case. As machine-generated data increases and companies invest in ML initiatives, unstructured data is becoming more and more common, accounting for\u00a0<a href=\"https:\/\/www.cio.com\/article\/3406806\/ai-unleashes-the-power-of-unstructured-data.html\" rel=\"noopener noreferrer\" target=\"_blank\">over 90 percent of all new data produced<\/a>.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-to-build-your-data-platform-choosing-a-cloud-data-warehouse-3de66862f41c\" rel=\"noopener noreferrer\" target=\"_blank\">Typically stored in data lakes<\/a>, unstructured data does not have a predefined model and must go through multiple transformations to be usable and useful. Unstructured data is very dynamic, with its shape, source, and meaning changing all the time as it goes through various phases of processing, including transformation, modeling, and aggregation. What we do with this unstructured data (i.e., transform, model, aggregate, and visualize it), makes it much more difficult to catalog in its \u201cdesired state.\u201d<\/p>\n<p>On top of this, rather than simply\u00a0<em>describing<\/em>\u00a0the data that consumers access and use, there\u2019s a growing need to also\u00a0<em>understand<\/em>\u00a0the data based on its intention and purpose. How a producer of data might describe an asset would be very different from how a consumer of this data understands its function, and even between one consumer of data to another there might be a vast difference in terms of understanding the meaning ascribed to the data.<\/p>\n<p>For instance, a data set pulled from Salesforce has a completely different meaning to a data engineer than it would to someone on the sales team. While the engineer would understand what \u201cDW_7_V3\u201d means, the sales team would be scratching their heads, trying to determine if said data set correlated to their \u201cRevenue Forecasts 2021\u201d dashboard in Salesforce. And the list goes on.<\/p>\n<p>Static data descriptions are limited by nature. In 2021, we must accept and adapt to these new and evolving dynamics to truly understand the data.<\/p>\n<p>\u00a0<\/p>\n<h3>Data is distributed; catalogs are not<\/h3>\n<p>\u00a0<br \/>Despite the distribution of the modern data architecture (see:\u00a0<a href=\"https:\/\/towardsdatascience.com\/what-is-a-data-mesh-and-how-not-to-mesh-it-up-210710bb41e0\" rel=\"noopener noreferrer\" target=\"_blank\">the data mesh<\/a>) and the move towards embracing semi-structured and unstructured data as the norm, most data catalogs still treat data like a one-dimensional entity. As data is aggregated and transformed, it flows through different elements of the data stack, making it nearly impossible to document.<\/p>\n<div><img src=\"https:\/\/miro.medium.com\/max\/467\/1*WqFcAipu4s-Jg08-LwoR8A.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Traditional data catalogs manage metadata (data about your data) at the ingest state, but data is constantly changing, making it hard to understand the health of your data as it evolves in the pipeline. Image courtesy of Barr Moses.<\/em><\/p>\n<p><\/span><\/div>\n<p>\u00a0<\/p>\n<p>Nowadays, data tends to be\u00a0<a href=\"https:\/\/www.gartner.com\/en\/information-technology\/glossary\/self-describing-messages#:~:text=A%20message%20that%20contains%20data,consists%20of%20tag%2Fvalue%20pairs.\" rel=\"noopener noreferrer\" target=\"_blank\">self-describing<\/a>, containing both the data and the metadata that describes the format and meaning of that data in a single package.<\/p>\n<p>Since traditional data catalogs are not distributed, it\u2019s near to impossible to use as a central source of truth about your data. This problem will only grow as data becomes more accessible to a wider variety of users, from BI analysts to operations teams, and the pipelines powering ML, operations, and analytics become increasingly complex.<\/p>\n<p>A modern data catalog needs to federate the meaning of data across these domains. Data teams need to be able to understand how these data domains relate to each other and what aspects of the aggregate view are important. They need a centralized way to answer these distributed questions as a whole \u2014 in other words, a distributed, federated data catalog.<\/p>\n<blockquote>\n<p>\nInvesting in the right approach to building a data catalog from the outset will allow you to build a better data platform that helps your team democratize and easily explore data, allowing you to keep tabs on important data assets and harness their full potential.\n<\/p>\n<\/blockquote>\n<p>\u00a0<\/p>\n<h3>Data Catalog 2.0 = Data Discovery<\/h3>\n<p>\u00a0<br \/>Data catalogs work well when you have rigid models, but as data pipelines grow increasingly complex and unstructured data becomes the golden standard, our understanding of this data (what it does, who uses it, how it\u2019s used, etc.) does not reflect reality.<\/p>\n<p>We believe that next generation catalogs will have the capabilities to learn, understand, and infer the data, enabling users to leverage its insights in a self-service manner. But how do we get there?<\/p>\n<div><img src=\"https:\/\/miro.medium.com\/max\/468\/1*u2xfQDsAGrvUyrxqPfDNuA.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><em>Data discovery can replace the modern data catalog by providing distributed, real-time insights about data across different domains, all while abiding by a central set of governance standards. Image courtesy of Barr Moses.<\/em><\/p>\n<p><\/span><\/div>\n<p>\u00a0<\/p>\n<p>In addition to cataloging data, metadata and data management strategies must also incorporate data discovery, a new approach to understanding the health of your distributed data assets in real-time. Borrowing from the distributed domain-oriented architecture proposed by Zhamak Deghani and Thoughtworks\u2019\u00a0<a href=\"https:\/\/martinfowler.com\/articles\/data-monolith-to-mesh.html\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>data mesh model<\/strong><\/a>, data discovery posits that different data owners are held accountable for their data as products, as well as for facilitating communication between distributed data across different locations. Once data has been served to and transformed by a given domain, the domain data owners can leverage the data for their operational or analytic needs.<\/p>\n<p>Data discovery replaces the need for a data catalog by providing a domain-specific, dynamic understanding of your data based on how it\u2019s being ingested, stored, aggregated, and used by a set of specific consumers. As with a data catalog, governance standards and tooling are federated across these domains (allowing for greater accessibility and interoperability), but unlike a data catalog, data discovery surfaces a real-time understanding of the data\u2019s current state as opposed to it\u2019s ideal or \u201ccataloged\u201d state.<\/p>\n<p>Data discovery can answer these questions not just for the data\u2019s ideal state but for the current state of the data across each domain:<\/p>\n<ul>\n<li>What data set is most recent? Which data sets can be deprecated?\n<\/li>\n<li>When was the last time this table was updated?\n<\/li>\n<li>What is the meaning of a given field in my domain?\n<\/li>\n<li>Who has access to this data? When was the last time this data was used? By who?\n<\/li>\n<li>What are the upstream and downstream dependencies of this data?\n<\/li>\n<li>Is this production-quality data?\n<\/li>\n<li>What data matters for my domain\u2019s business requirements?\n<\/li>\n<li>What are my assumptions about this data, and are they being met?\n<\/li>\n<\/ul>\n<p>We believe that the next generation data catalog, in other words, data discovery, will have the following features:<\/p>\n<p>\u00a0<\/p>\n<h3>Self-service discovery and automation<\/h3>\n<p>\u00a0<br \/>Data teams should be able to easily leverage their data catalog without a dedicated support team. Self-service, automation, and workflow orchestration for your data tooling removes silos between stages of the data pipeline, and in the process, making it easier to understand and access data. Greater accessibility naturally leads to increased data adoption, reducing the load for your data engineering team.<\/p>\n<p>\u00a0<\/p>\n<h3>Scalability as data evolves<\/h3>\n<p>\u00a0<br \/>As companies ingest more and more data and unstructured data becomes the norm, the ability to scale to meet these demands will be critical for the success of your data initiatives. Data discovery leverages machine learning to gain a bird\u2019s eye view of your data assets as they scale, ensuring that your understanding adapts as your data evolves. This way, data consumers are set up to make more intelligent and informed decisions instead of relying on outdated documentation (aka data about data that becomes stale, how meta!) or worse \u2014 gut-based decision making.<\/p>\n<p>\u00a0<\/p>\n<h3><a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_lineage\" rel=\"noopener noreferrer\" target=\"_blank\">Data lineage<\/a>\u00a0for distributed discovery<\/h3>\n<p>\u00a0<br \/>Data discovery relies heavily on automated table and field-level lineage to map upstream and downstream dependencies between data assets. Lineage helps surface the right information at the right time (a core functionality of data discovery) and draw connections between data assets so you can better troubleshoot when data pipelines do break, which is becoming an increasingly common problem as the\u00a0<a href=\"https:\/\/www.montecarlodata.com\/data-observability-how-to-prevent-your-data-pipelines-from-breaking\/\" rel=\"noopener noreferrer\" target=\"_blank\">modern data stack evolves<\/a>\u00a0to accommodate more complex use cases.<\/p>\n<p>\u00a0<\/p>\n<h3>Data reliability to ensure the gold standard of data \u2014 at all times<\/h3>\n<p>\u00a0<br \/>The truth is \u2014 in one way or another \u2014 your team is probably already investing in data discovery. Whether it\u2019s through manual work your team is doing to verify data, custom validation rules your engineers are writing, or simply the cost of decisions made based on broken data or silent errors that went unnoticed. Modern data teams have started leveraging automated approaches to ensuring highly trustworthy data at every stage of the pipeline, from data quality monitoring to more robust, end-to-end\u00a0<a href=\"https:\/\/towardsdatascience.com\/how-do-you-prevent-broken-data-pipelines-326f3c6d239e\" rel=\"noopener noreferrer\" target=\"_blank\">data observability platforms<\/a>\u00a0that monitor and alert for issues in your data pipelines. Such solutions notify you when data breaks so you can identify the root cause quickly for fast resolution and\u00a0<a href=\"https:\/\/www.montecarlodata.com\/the-rise-of-data-downtime\/\" rel=\"noopener noreferrer\" target=\"_blank\">prevent future downtime<\/a>.<\/p>\n<p>Data discovery empowers data teams to trust that their assumptions about data match reality, enabling dynamic discovery and a high degree of reliability across your data infrastructure, regardless of domain.<\/p>\n<p>\u00a0<\/p>\n<h3>What\u2019s next?<\/h3>\n<p>\u00a0<br \/>If bad data is worse than no data, a data catalog without data discovery is worse than not having a data catalog at all. To achieve truly discoverable data, it\u2019s important that your data is not just \u201ccataloged,\u201d but also accurate, clean, and fully observable for ingestion to consumption \u2014 in other words: reliable.<\/p>\n<p>A strong approach to data discovery relies on automated and scalable data management, which works with the newly distributed nature of data systems. Therefore, to truly enable data discovery in an organization, we need to rethink how we are approaching the data catalog.<\/p>\n<p>Only by understanding your data, the state of your data, and how it\u2019s being used \u2014 at all stages of its lifecycle, across domains \u2014 can we even begin to trust it.<\/p>\n<p><strong><em>Want to learn more about\u00a0<\/em><\/strong><a href=\"http:\/\/www.montecarlodata.com\/\" rel=\"noopener noreferrer\" target=\"_blank\"><strong><em>building better data catalogs<\/em><\/strong><\/a><strong><em>? Reach out to\u00a0<\/em><\/strong><a href=\"https:\/\/www.linkedin.com\/in\/debashis1saha\" rel=\"noopener noreferrer\" target=\"_blank\"><strong><em>Debashis Saha<\/em><\/strong><\/a><strong><em>\u00a0or\u00a0<\/em><\/strong><a href=\"https:\/\/www.linkedin.com\/in\/barrmoses\" rel=\"noopener noreferrer\" target=\"_blank\"><strong><em>Barr Moses<\/em><\/strong><\/a><strong><em>\u00a0and the\u00a0<\/em><\/strong><a href=\"http:\/\/www.montecarlodata.com\/\" rel=\"noopener noreferrer\" target=\"_blank\"><strong><em>Monte Carlo team<\/em><\/strong><\/a><strong><em>.<\/em><\/strong><\/p>\n<p>\u00a0<br \/><b><a href=\"https:\/\/www.linkedin.com\/in\/debashis1saha\/\" target=\"_blank\" rel=\"noopener noreferrer\">Debashis Saha<\/a><\/b> is the VP of Engineering at AppZen. Prior, he served as VP of Data Platforms at Intuit and eBay.<\/p>\n<p><b><a href=\"https:\/\/www.linkedin.com\/in\/barrmoses\/\" target=\"_blank\" rel=\"noopener noreferrer\">Barr Moses<\/a><\/b> is the CEO and Co-founder of Monte Carlo, a data observability company. Prior, she served as a VP of Operations at Gainsight.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/data-catalogs-are-dead-long-live-data-discovery-a0dc8d02bd34?source=friends_link&amp;sk=3140d4e8c1e40c35ec7a29967f81a453\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/12\/data-catalogs-dead-long-live-data-discovery.html<\/p>\n","protected":false},"author":0,"featured_media":8041,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8040"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8040"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8040\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8041"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8040"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8040"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8040"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}