Why Dimensional Data Outlives Every Tool · Library · CleanDims

A reader arriving at the dimensional data problem for the first time reasonably asks two questions. The first is whether the existing tools in their stack should have caught it. The second is whether the category called Master Data Management is the answer.

The short version is no, in both cases, and the reasons are not the same. Data quality tools were built to detect data that is broken; categorical variance is not broken. Master data management was built to govern entities; categorical attributes on those entities are governed at a different layer. This article walks through the precise boundaries: what each adjacent category was designed to solve, where its scope ends, and why dimensional data sits in the gap between them.

The reader who finishes this article should be able to answer the “we already have DQ tooling” and “we are evaluating MDM” objections with a clearer sense of what those investments do and what they leave unaddressed. The longer foundational argument is in the primer and the target state; this article is the positioning piece that sits between the foundational documents and the practitioner who is mapping the new category against the categories they already know.

What a canonical dimension actually is.

For the rest of the article to make sense, the central term needs to be precise. Canonical means “the agreed, authoritative form.” A canonical dimension is a governed reference that defines the full set of valid canonical values for that dimension, records every known raw variant observed in the organisation's systems, and maps each variant to its canonical form.

A canonical dimension is not a cleaned column in a spreadsheet. A cleaned column is the output of a one-time cleanup; the next time data arrives, the column is dirty again. A canonical dimension is a registry: a structured, versioned, governed reference that lives outside any individual system and that every system can join against. Source data does not change. The registry sits beside it.

The registry's structure is straightforward. Each entry has a canonical value, a definition, a named owner, a version, introduction and retirement dates, and the aliases under which the same value has been observed. A typical row, for the vendor name dimension, might look like this:

raw_value                  canonical_value                  source        vendor_id
AWS                        Amazon Web Services, Inc.        ERP           V-0041
aws                        Amazon Web Services, Inc.        Expenses      V-0041
Amazon Web Services        Amazon Web Services, Inc.        PO System     V-0041
Amazon AWS                 Amazon Web Services, Inc.        Contracts     V-0041

At query time, the consumer joins their raw data against the registry on the raw value and pulls the canonical into the query. The total spend reconciles across the four source records, regardless of which label each system happens to use.

What data quality tooling solves, and what it does not.

Most practitioners encountering the dimensional data problem will have data quality tooling running somewhere in their stack. The natural assumption is that the DQ checks should have flagged the variance. The natural assumption is wrong, and understanding why is the first step in seeing the boundary clearly.

Data quality tools are good at a specific class of problem: structural and format issues, where the data is objectively malformed, missing, or out of range. The tools were built around rule-based validation, and rule-based validation works well for problems that can be expressed as rules.

A useful way to see the boundary is to put the two categories of problem side by side:

What DQ tools catch	What DQ tools miss
Null values in required fields	"AWS" and "Amazon Web Services, Inc." are the same supplier
Duplicate primary keys	"SWE" and "Software Engineer L5" are the same role
Type mismatches (string in date field)	"Mid-Market," "Tier 2," and "SMB" mean different things to different teams
Values outside expected ranges	"MSA" and "Master Services Agreement" are the same contract type
Referential integrity violations	"prod," "PROD," "prd," and "live" are the same environment
Row count anomalies	"Q4-2024-EMEA-Brand" and "q4 brand emea" are the same campaign

The pattern in the right column is consistent. Two values are structurally valid, each non-null, each of valid length, each of the correct type, each within expected ranges, each referentially intact. The DQ check passes for both. The semantic equivalence between them is invisible to the rule engine, because the rule engine has no concept of what does this string refer to in the world.

This is not a gap in DQ tooling. It is a different problem. Validity is not the same as consistency. A field can be entirely valid and entirely inconsistent at the same time, and the consistency problem requires a different kind of resolution: semantic resolution and canonical mapping, not rule-based validation.

What master data management was built for.

The category called Master Data Management comes up in nearly every conversation about dimensional data, usually from a practitioner who has heard the term and reasonably assumes it covers exactly this kind of problem, sometimes from a vendor whose platform happens to include an MDM module. The assumption deserves a careful answer, because the answer is no, but the reasoning is subtle.

MDM platforms govern the master records of core business entities: customers, products, suppliers, employees. Their job is to maintain one golden record per entity, synchronise that record across the systems that hold copies, manage the full lifecycle of the entity, and govern who is allowed to update it and how.

MDM answers questions like: who is the authoritative record for this customer? What is the current canonical profile of this supplier? Which product record is the master when three systems hold conflicting attributes? These are real questions, and MDM platforms are the right answer to them. An MDM implementation typically runs twelve to eighteen months, requires dedicated platform administration, and integrates deeply with the systems that hold source entity records. The return on the investment is real, but it is the return of replacing fragmented entity management across the organisation, not the return of resolving categorical naming variance.

Where the confusion arises is that the categorical surface and the entity surface look similar from a distance. Both involve labels on records. Both involve variance across systems. Both feel like “naming problems.” The difference becomes visible at the level of what each layer governs:

What MDM governs	What MDM does not govern
Which customer record is the authoritative golden record	What the customer's industry attribute should be labelled
Supplier deduplication and lifecycle	Variant labels for the same supplier across systems
Product mastering across commerce and ERP	Product category taxonomy and its variants
Employee record reconciliation	Job title taxonomy and its variants
Bidirectional sync of entity records	Categorical attributes on those entity records

MDM tells the organisation that supplier record V-0041 is the golden record for Amazon Web Services. It does not resolve the fact that procurement calls them “AWS” in invoices, the ERP calls them “Amazon Web Services,” and the expense tool calls them “aws.” Those are categorical labels on the entity, not the entity itself. They live at a different layer.

The two layers are complementary.

A practitioner reading the above might conclude that MDM and dimension management are competitors. They are not. They operate at different layers and address different problems.

Consider the concrete case of an organisation with a mature MDM implementation. The MDM platform holds the golden supplier record for Amazon Web Services, Inc., with internal identifier V-0041. The MDM platform synchronises that record with the ERP, the procurement system, and the CRM, ensuring that all three know about the same supplier under the same internal identifier.

The MDM platform does not, however, resolve the inbound categorical labels. When an invoice arrives in the procurement system carrying the string “AWS,” the MDM platform has no opinion about which canonical that string maps to. When a sales rep enters “Amazon AWS” in the CRM, the MDM platform accepts the entry against the supplier record but does not normalise the label. The categorical attribute on the entity, the literal string that downstream consumers will join against, is unmanaged.

Dimension management is the layer that resolves the inbound categorical labels. It does not replace MDM; it complements it. MDM solves the “which record is authoritative” problem; dimension management solves the “how do we resolve any incoming string back to that authoritative record” problem. Both are real. Neither replaces the other.

A useful way to think about the relationship: MDM owns the entity. Dimension management owns the categorical surface of that entity. The two layers connect at the canonical reference, where the canonical form of the categorical label points back to the master entity record. The architecture is two-tiered and it works.

Why the problem outlives every tool that touches it.

DQ tooling is running. MDM is in place. Analysts are cleaning files before every report. The categorical inconsistency is still there six months later. Three structural reasons explain why the problem persists, none of which involves any of the existing tools failing at their job.

The first is the backlog problem. The scope of the dimensional data problem in a typical organisation is larger than any team initially assumed. A single dimension might have several thousand distinct raw values in the warehouse, half of which are duplicates of one another, the other half of which are genuinely different categories that have accumulated unrecognised over years. The work to canonicalise even one dimension is meaningful; the work to canonicalise the dozens of dimensions that matter is overwhelming. The tools were not absent; the backlog was simply larger than the resources allocated to clear it.

The second is the context problem. Ambiguous cases require business knowledge that no algorithm can infer from the strings alone. Does “SL” mean Silver tier or SnapLogic the vendor? Does “Core” refer to a product area, a contract tier, or a customer segment? Does “ENG” mean Engineering the function or England the country? Each of these can be answered, but only by someone who knows the organisation. Automation can resolve the high-confidence cases; the low-confidence cases require human judgement, and human judgement requires context.

The third is the governance problem. Even if the backlog is cleared and the context is resolved, the problem regenerates. New vendors are onboarded. New campaigns adopt new naming conventions. New products restructure the taxonomy. An acquisition adds a parallel set of categories. Without an operating model that catches new variance as it appears, the cleanup that took six months to complete starts decaying immediately. Cleanup without governance is a workaround, not a solution.

The first two reasons explain why the problem is hard to solve. The third explains why it stays solved only if the solution includes an operating model, not just a one-time intervention. This is the argument the target state document develops in full.

A ten-minute test, if you want to know whether this is your problem.

A reader who has followed the argument this far may want to know how to check for the problem in their own data. Five quick tests, runnable in under ten minutes on any dimension the organisation cares about.

Run SELECT COUNT(DISTINCT vendor_name) FROM your_table on the raw data in the system that holds the most records for one dimension. If the distinct count is significantly larger than the number of actual real-world values you know exist, variance is present.

Pick one entity you know well, a single supplier or job title or segment, and filter the column for that value. Count the distinct spellings, abbreviations, and casing variants. The count for any frequently-used value is rarely below three.

Export the same dimension from a second system that records related data and do a join on the raw value. Count how many records fail to match across the two systems. That count is your cross-system variance gap, and it is the work an analyst is doing every time they pull a cross-functional report.

Take ten raw values from the dimension and ask three different analysts to produce a canonical form for each. Compare the results. If the canonical forms differ across analysts, there is no shared standard, which means there is no canonical dimension in the formal sense; there are three informal standards held in three different people's heads.

Ask whether a document exists that defines the canonical values for the dimension, whether it is currently maintained, and whether it is referenced when new records are entered. If the answer to any of these is no, the canonical dimension does not formally exist for the organisation, regardless of how confident any individual is about what the right values are.

If any of the five tests reveals an issue, the problem is present. It may be small and manageable today. It will be larger and more expensive next year. Categorical variance accumulates, and the cost of resolving it grows with the volume of data that depends on the dimension being consistent.