Data Governance (Should be) the Next Big Thing

Part 1: Approaches to Data Governance

Note: This article is the first of a three-part series exploring enterprise data governance in the era of AI. It is co-written with my former Disney colleague and friend, Monica Fulvio, who managed data governance efforts for the Disney media services IT group that supports TV, cable, motion picture and streaming production. In this first article, we’ll look at some motivations for data governance, then move on to common pitfalls and, finally, keys to successful data governance. Initially this was going to be a single article, but we found the topic so large we wanted to break it up to cover it adequately. This article was previously published to LinkedIn.

The data governance market is pretty big and growing fast, according to analysts. Fortune Business Insights estimates the market in 2024 to have been $4.4 billion and expects it to grow to $19 billion over the next six years. While the healthy 20% annual growth projection recognizes the importance of data quality to organizations, the numbers are a tiny sliver of the overall global IT market, which Gartner estimates will reach $5.6 trillion this year. The IT market projections are dominated by hardware spending to upgrade core processing to support genAI implementations.

But we also know from analyses and personal experience that AI projects–generative or otherwise–are being held back by data scarcity within enterprises. It’s not that data doesn’t exist—there’s more than ever—but usable data is more rare than enterprise leadership might expect. And “usable for AI” is even more scarce because the data must be relevant, accurate, unambiguous, unique (not redundant) and complete (has adequate coverage of the problem set) in order to realistically train automated data processes–whether a machine learning algorithm, a neural net or a vector database. So, we expect interest in data governance to only increase, possibly beyond current projections, as AI projects move forward.

We also know from experience that data governance programs tend to sputter at organizations because of unrealistic expectations of the level of effort required to produce and maintain “gold standard” data. In addition to the data quality characteristics noted above, a data governance program must ensure that high quality data is accessible (to the right people or systems), plentiful and timely. Meeting those goals actually is a pretty tall order. If you hope to create a governance system that ensures that the right data is available for the foreseeable future without adding an army of data managers, engineers, librarians and stewards, then the problem is an order of magnitude more complex. It’s the difference between data clean-up, which might be a time-boxed exercise that can be outsourced, to a long-term quality management practice as a core function of the enterprise.

Motivations for Governing Data

To assess the likelihood of a successful data management strategy we should start at the beginning: by understanding the motivation for starting a data governance program in the first place. It might be tempting to divide such an investigation into “before genAI” and the present, but there is no real need to do so. AI-based automation provides significant impetus for getting a handle on enterprise data, but it doesn’t really change the fundamentals for data governance. Companies have been wrestling with data quality, availability and efficiency issues for decades. Further, the good governance strategies that supported automation efforts of the past are the same as those supporting genAI adoption today.

First Principles: Data Discovery

One of the most common motivations we hear from former colleagues and current clients is more quotidian: a desire to understand their own data landscape. We often hear something to the effect of “I know we have to do something about our data, but I can’t make a decision about it until I understand the problem.” So let’s call that “Motivation 1: Data Discovery.”

Decision-makers must first determine what data problem they are attempting to solve: Common issues are one or more of the following:

Scale – there’s just too much data for existing systems and processes to handle;
Scope – data assets span multiple areas of expertise that cannot be managed by a single team;
Quality – data assets of questionable quality are frequently rejected by internal or external consumers; lead to compliance issues; or cannot be trusted because of insufficient testing and/or auditing;
Accessibility – data assets are plentiful but are under-utilized because would-be consumers don’t know about them;
Efficiency – data assets are produced by older, inefficient processes (e.g., manual data entry); are produced over and over again at great expense because of accessibility issues; or are produced without sufficient quality controls causing more work downstream.
Risk – Curtailing privacy and regulatory risk and reporting on it more efficiently is a common data governance justification.

Managing Growth

Reducing the costs of data management without reducing volume or quality often is a strong motivator. (This isn’t to be confused with constraining volume for storage costs; this is no longer a common concern outside of very large media files.) Indeed, it is often the anticipation that current data management systems and practices will not be able to keep pace with new business needs that inspires a data governance effort–if not a complete re-working of the data management infrastructure. Sometimes the cost of producing and maintaining data is so high that companies choose to go without it, constraining their ability to be more efficient or intentional elsewhere.

The Indirect Costs of Data Quality

The advent of genAI is highlighting two additional costs that might have been somewhat hidden previously: ambiguity and accountability. Training or fine-tuning an LLM is infamously expensive–even for companies working with a pre-trained LLM as a starting point. The risks and costs of using an LLM to mass produce poor quality data without supervision can be even higher. So, feeding the LLM incorrect, out of date, unclear or irrelevant data is an enormous waste of resources, as many companies are finding out the hard way.

It is quite common for data coming from two different sources within a company to use the same field but mean different things. For example, from our own experience, the idea of a “release date” for a movie or TV episode is vastly more complex than you might imagine (e.g., theatrical release versus when it drops on streaming versus per-territory planned release, even without corner cases of original versus rerelease). And, yes, different studios have different understandings of such a value. Therefore, when data is combined from multiple studios it can cause numerous problems if not normalized to a common interpretation, or cleanly disambiguated in ways that make those differences explicit.

These sorts of costs can seem indirect because it takes a while to trace bad business outcomes to ambiguous data. Then, when the culprit is found, organizations face a different challenge. It’s not a system failure or pipeline or storage issue. It’s an organizational issue. Suddenly, business units long used to operating independently and defining their own data landscape need to cooperate. A data pipeline can be built to normalize values from different fields–the value in Field X from this source is the same as Field Y in that source, so combine them as the data is integrated. But it is far harder to build a system to reconcile a difference of interpretation of the same value. It is sometimes the case that the sources themselves don’t know the full meaning of the data they produce.

These are classic data governance issues and highlight the less technical side of the problem space. A data governance system doesn’t just show you all the data you have, it is an opportunity to maintain a common understanding of data and its definitions. However, just fostering discussion among groups can be a significant challenge and requires an organization structure that supports such cooperation. Fortunately, the perceived value of genAI is providing a stronger justification for the cost and effort to identify and remedy such issues than ever before.

Know Your “Why”

Whatever your motivation for data governance, it’s always an exercise in reducing risk. All companies create, store and analyze data at some scale–financial, commercial, customer, operational. Most companies do so through a set of applications that are probably not designed to share data. Your CMS, CRM, project management, marketing, wiki and finance applications are probably distinct. If you’re managing a supply chain, warehouses and delivery logistics are adding specialized systems on top of the basics. If you are doing product or primary research–drug discovery, for example–you are adding data sources. If you’re in a regulated industry, like finance, you need to know not just where your data is but how it was created, who’s accessed it, and whether you can prove that it’s accurate. All businesses store personally identifiable information on someone–employees, customers, prospects, contractors.

The bottom line is that every business has data and at some point every business needs a way to manage and protect the data they have. Eventually, businesses want to get more value from it, and at some point will likely determine that they need better controls at the point of origin. So, every business governs data in some way. The question becomes “how much time and money do I need to spend on data management?” Within that is usually, “do I need a data governance effort or software?”

From our perspective, it’s safe to assume that the answers are “more than you’re doing now,” and “yes,” though what that data management effort (and investment) looks like hinges on that initial answer to the core data problem you’re trying to solve and what you learn during that initial data discovery effort about your current maturity level compared to the problem you need to solve.

In the next installment of this series we’ll look at common pitfalls that organizations experience when they begin to wrest control over their data. Then we’ll conclude with some recommendations on how to approach the issue, with an emphasis on understanding your needs.