The 5 Levels of Data Governance

Note: Once again we find ourselves with too much to write on this topic. Or maybe we just like to hear ourselves type. Either way, we have decided to split our final entry in this series into two documents. This article will focus on our proposed “governance levels” model, against which organizations can gauge their level of effort vis a vis needs and goals. We will also put a more standard overview of governance best practices as a post on the Tauru Systems LLC blog page. All these pieces are co-written with data governance professional, Monica Fulvio.

Business Perception, IT Reality

Inevitably, there is a gap in perception, or maybe intention between business and IT leaders. There is a lot of talk these days about “data democratization,” and “data access,” within enterprises. These terms appear to be a desire to improve the processes to put data into the hands of people or tools that can use it. At this level, business and IT goals are aligned: automate data pipelines to remove human intermediaries between data collection and data consumption. The business goals are speed and lower costs–mostly labor costs associated with creating, delivering and maintaining data. IT goals are less effort and fewer moving parts.

However, these broad goals may be in conflict with the need to improve the corporate data landscape, let alone support the privacy and security complexity implied by broad, immediate and unattended data access. As we discussed previously, improving existing data can be a one-off exercise. “Improving” usually means both cleaning up existing data–eliminating duplicates, lowering ambiguity, patching holes in coverage, normalization–and upgrading data governance to avoid costly cleanup in the future. These days, they want genAI to be the catalyst to realize speed and cost-saving reasons, only to run again into the data quality/availability inhibitor.

I have long noted that almost any new project dreamed up by the business–a new app, feature, service offering–assumes that the data to support it is already there. More often than not, that turns out not to be true. It’s actually uncanny, but it’s getting better.

This guide will help you map data governance efforts to business goals.

Data Maturity and Governance Maturity

Business decision-makers understand that they need access to well-governed data in order to train LLMs and build systems like help desk assistants or product recommendation systems. That this has always been true for any AI/ML application is ironic only to long-suffering data engineers and scientists. The new focus on “good” or “clean” data has led to the adoption of several popular data maturity assessment models. (Here’s a handy ChatGPT summary of them. We don’t need to reprise that information here.)

When businesses hit a roadblock caused by lack of appropriate data, the business either embarks on a program of generating that data or changes its goals. Sometimes the business goes back to rethink their entire data strategy. Sometimes they downsize their ambition and agree to make do with “embedded AI” offered by an existing platform. Sometimes they barrel ahead without regard for overall quality or reusability or even consistency. This latter approach is often adopted in “proof of concept” projects that seek to test the viability of a technology, only to realize that the tech is fine but the project won’t scale to production. We see this a lot in the knowledge graph business, to the eternal frustration of KG vendors everywhere.

Regardless of the chosen approach, somewhere along the way, an executive has asked, “How hard will it be to start generating the information we need?” The answer, almost always, is “harder than we first thought.” The costs start to add up: more databases, more SAAS, more IT staff, more hardware, more business analysts, more data engineers, scientists, stewards and librarians. The solution set is so complex that it’s not even a classic “build or buy” question; it’s never that binary.

But just as there are levels of data maturity, within that model structure there are levels of data governance maturity. And you don’t have to get ahead of your needs. Data governance can be adopted incrementally. Understanding what level to pursue depends on your goals, a realistic assessment of budget and the runway to achieve them. Not every business needs to train its own LLM, or should. And there are plenty of reasons to tackle data governance that don’t have much to do with AI. If governance is a key pillar of overall data maturity, then we can also expand on those models to enumerate levels of data governance sophistication within the same framework.

The Levels of Data Governance: Mapping Effort to Business Goals

We will refrain from calling the following list a “maturity” model, because in this case covering the basics of data governance may be all a company requires. Meeting simple needs with minimal cost is just as mature as pushing the boundaries of data pipelines.

Level 1: Basic Documentation and Data Discovery

This can be fairly simple: Understand what data you have where, the workflows involved in it, and your best opportunities and worst pain points around your data. Along with this, produce a document of the core data elements, describing the nature, meaning and location of that data.

This doesn’t need to be hugely complex or have an expensive system; for organizations that have data, but data isn’t the core of the business, a spreadsheet or light database may be adequate. This is also a great way to get started on this effort. For others, there are several commercial catalog companies that can help. At a minimum, make sure that record covers the authoritative system for every piece of data. If you want to do more with data later, this is an essential first step.

Another key point to determine early on: how much data needs to be shared across systems? In many companies, data from the financial system does not need to be shared with the CMS. But maybe some pieces of that data need to be shared, such as customer or vendor IDs. If your business needs for mixing data from multiple source systems is minimal, then Level 1 is just fine.

Level 2: Data Catalog and Core Glossary, Data Ownership

You’ll rapidly find that keeping that basic data documentation up to date in a spreadsheet is extraordinarily difficult as your data volume and variety grows. This is especially true if a goal is gaining a useful view across multiple systems or data sets. This use case is at the heart of value offered by a data catalog–regardless of any additional bells and whistles. Data catalogs help manage the effort of maintaining data documentation and cataloging (though it still takes work), and turn it into a (more or less) friendly system for consumers to answer the “what do I have?” question.

If you’re serious about cataloging, identify organizational owners for core data properties and data sets and define essential terms for the business (which can help streamline and standardize the cataloging effort, by anchoring data points across systems in a shared concept set).

Note that this recommendation assumes an expanded investment–both for even the cheapest options in data catalog technology, and in time from the catalog owner and the data owners and stewards. Again–and we can’t say this enough–pay attention to the organizational structure (e.g., clear lines of ownership and responsibility, clear understanding of stakeholders, business goals) required to support your improving data maturity.

Level 3: Access Management, Privacy, Regulatory

A lot of large enterprises start here: They’ve had a data breach or an adverse regulatory event or a failed product launch, because the data wasn’t actually “there,” and they realized it too late. Access management and governing privacy and regulatory needs builds on top of the foundations of good governance practice established in Levels 1 & 2. How often does a company embark on a privacy exercise only to find they don’t know what data they have or what it means, and they have a semi-implemented data catalog languishing in a corner? Of course, access and privacy controls are non-negotiable for regulated businesses. Groups working in low-regulation or PII data spaces may simply have less (or less pressing) work to do here, but these steps are, in turn, a required foundation for democratization.

You need to understand the privacy, regulatory and access requirements of your data at a data element level, which means that you need to understand what that data is and where it is–i.e., have comprehensive data cataloging and ownership in place already–and then overlay and describe that data with a taxonomy of your access, privacy and regulatory needs. If you are automating a data pipeline, having a data element that can be changed by someone without sight into the bigger picture is a common weak point in the pipeline.

This level, therefore, is also where you want to track data lineage and have an understanding of downstream data usage, as well as a system of record for the core data points.

Level 4: Managed Pipeline & Democratization

By this point, your organization has well-cataloged data that can safely be given broad access (without overstepping access, privacy or regulatory concerns); this is where we get to that “self-service data mart” dream mentioned earlier.

There’s still a lot of process management at this stage–if you want to change the data, process oversight and adjustments will be required; and stitching together related but separate data points across the pipeline may take significant analysis. Organizations often use a graph database as data fabric to connect data at a categorical level (e.g., Customers are a category of People who place Orders for Products).

Level 5: The Future–Adaptive Pipeline Orchestration

A fully automated data pipeline that adapts (even semi-automatically) to new data inputs and new business needs is more of a dream than a reality for even large, sophisticated data-dependent businesses.

If we peel back to what business leaders really want, their ambition goes beyond data delivery automation and into data collection automation. They want pipelines that not only automatically generate data, but can be re-tasked upon demand to collect different or more information. This difference comes out clearly when business leaders talk about genAI, and we’ve seen several times businesses unwittingly conflate these two.

When asking “what is the effort required to get the information we need?” there is often an underlying hope that new data can just be created with the existing pipeline. At the same time, some early genAI projects had a hope that genAI could cover over the cracks in their corporate data–whether that’s using general knowledge from the LLM to answer questions or “interpreting” valuable data from unstructured corporate repositories. These desires combine the expectation that data is already “there” to support a genAI-fueled low-effort, high-reward future or that new data can be added seamlessly to complete the picture.

However, even if you have achieved Level 4, building a data governance system to adjust to new data is well beyond the current ambitions of most corporate IT departments, even if you don’t expect adjustments to be fully automatic. Not that such fanciful stuff isn’t being done somewhere, but it’s well outside the normal scope of a governance effort.

Let’s use a supply chain example to illustrate. Acme Widget Company has a sensor on its assembly lines to count the number of parts of various types that pass through the line in a given amount of time. Great, from that data the company can build out just-in-time inventory management to ensure enough of those parts are in stock to meet manufacturing schedules. But the company also wants to know how many of a particular part come from each supplier and then wants to trace those parts to the finished products and compare them to maintenance records. The company, of course, knows how many parts it buys from each supplier, but the assembly line sensors don’t distinguish one from the other. Meanwhile, the numbers of products per supplier are in paper or PDF records at the supply end. At the other end, parts can be distinguished by serial number, but collecting that information after the fact is going to be time-consuming and costly. Finally, it turns out that the assembly line sensors scan a barcode on each part as it passes through a gate, but the database that records the barcode numbers does not include the manufacturer.

The obvious answer here is to upgrade the scanning database to match barcodes with supplier serial numbers and manufacturers, but that’s not going to be enough. The company wants to know if parts from Manufacturer A are more likely to be involved in a repair request than the same parts from Manufacturer B. How do you get there? Digitize the supplier manifests, manually enter data as inventory comes into the warehouse, record serial numbers as part of the repair process and manually look up the match in the inventory database?

This is a real use case for which the proposed solution was to knit together data from various systems with a knowledge graph. But if you’re not collecting some key piece of data already, that’s not a complete solution. You need a data pipeline that can add new data automatically and feed it reliably to an analysis tool to give the business the information it needs. Even if retrieving the new information upon request was possible with or without human intervention, will your governance system keep pace to seamlessly integrate the new data with existing databases?

Such a governance platform–let’s call it “adaptive governance”–supporting a flexible data pipeline that can be re-tasked to collect new information is at the highest end of a hierarchy of data governance maturity.

In Summary

Captain Jack Sparrow once had a compass that would show him the way to “his heart’s desire,” but it didn’t work for him until he decided what he wanted. This guide hopefully helps you make a similar calculation by clarifying common data governance goals and what business values they unlock. Knowing the goal and how to get there are different things, of course. In short, you need a goal and a compass. So here are the bullet points:

Understand the goals of your business and stakeholders, including their genAI seamless data aspirations;
Know where the organization truly is with the state, extent and governance of their data; and
Make a realistic appraisal of the governance level you need to work towards, including the balance between the investment versus ROI.

Use this guide to gauge your desired level of governance effort against business ambitions and needs. Naturally, understand your own skill and knowledge level in data governance and management, and know when to bring in help. In our linked Best Practices document, we offer an overview to get you started on an implementation at any level.