Why Dimensional Data Outlives Every Tool

01 / 11

Start with the word dimension

In data, a dimension is a descriptive attribute used to categorise or segment records. It answers "what kind of thing is this?" rather than "how many?" or "how much?"

Vendor name is a dimension. Job title is a dimension. Campaign tag, support ticket category, customer segment, contract type: all dimensions. They are the labels your organisation uses to classify everything it tracks.

Most organisations have data quality tooling that checks these fields. Some have MDM platforms that govern the entities behind them. And yet the same categorical inconsistency keeps reappearing in reports, because neither tool was built for this specific problem. That is what this article is about.

A single dimension across four systemsVendor name

PO SystemAmazon Web Services

ERPAWS

Expense toolaws

ContractAmazon Web Services, Inc.

Four records. One supplier. No reliable spend total until a person reconciles them. This is a dimension without a canonical form.

02 / 11

What canonical means

Canonical means "the agreed, authoritative form." When a dimension has a canonical form, there is one official expression of each concept, and a mapping from every variant encountered in the real world back to that official expression.

A canonical dimension is not simply a cleaned column in a spreadsheet. It is a governed registry: a structured reference that defines the full set of valid canonical values, records every known raw variant, and maps each variant to the right canonical form. It lives outside any individual system, and it is what every system joins against when they need that dimension to be reliable.

Without a canonical dimension

Each system holds its own expression

Finance, sales, and analytics each refer to the same vendor by a different name. Cross-system reporting cannot reconcile them. Every analyst manually re-cleans the same data for every report.

✗ No single source of truth

✗ Manual pre-work before every analysis

✗ Cleanup diverges between teams over time

With a canonical dimension registry

Every system joins the same reference

Each raw value, however it was entered, maps to one canonical form. Spend from all four systems consolidates correctly. The report runs without manual pre-work.

✓ Single canonical form per concept

✓ All raw variants mapped and documented

✓ Source systems left entirely unchanged

03 / 11

What the registry looks like in practice

A canonical dimension registry is a table. Each row is a raw value observed somewhere in the organisation's data. Each row carries the canonical form it maps to, the source system it came from, and any enrichment that was added at the time of canonicalisation.

To use it, you join your raw data against the registry on the raw value column, and pull the canonical value into your query. The source data never changes. The registry sits beside it.

Canonical registry · Vendor NameThree rows, one canonical form

raw_value canonical_value source vendor_id

AWS Amazon Web Services, Inc. ERP V-0041

aws Amazon Web Services, Inc. Expenses V-0041

Amazon Web Services Amazon Web Services, Inc. PO System V-0041

At query time: SELECT SUM(spend) FROM invoices JOIN vendor_registry ON invoices.vendor_name = vendor_registry.raw_value. Total spend now reflects all four source records correctly.

04 / 11

Your data quality tool will not catch this.

This is the point where most teams get surprised. They have invested in data quality tooling. Checks are running. Alerts are firing. And yet the vendor spend total is still wrong, because AWS and Amazon Web Services, Inc. are each valid strings that pass every check.

Data quality tools catch what is broken. Categorical variance is not broken. It is just inconsistent. The tools were never designed to see the difference.

The problem is not that your DQ tooling is insufficient. The problem is that this is a categorically different class of issue. Semantic equivalence between two valid strings is invisible to a rule-based validation engine. It requires a different kind of resolution entirely.

05 / 11

The precise boundary of what data quality tooling solves

Data quality tools are genuinely good at a specific set of problems. This is not a criticism of them. They solve real and important issues. But their scope does not extend to semantic equivalence across free-text dimensions.

DQ tools handle this

Structural and format issues

Problems where the data is objectively malformed, missing, or out of range.

✓ Null values in required fields

✓ Duplicate primary keys

✓ Type mismatches (string in date field)

✓ Values outside expected ranges

✓ Referential integrity violations

✓ Row count anomalies

DQ tools miss this

Semantic equivalence issues

Problems where two values are structurally valid but semantically the same concept expressed differently.

✗ AWS = Amazon Web Services, Inc.

✗ SWE = Software Engineer L5

✗ Mid-Market = Tier 2 = SMB

✗ MSA = Master Services Agreement

✗ prod = PROD = prd = live

✗ Q4-2024-EMEA = q4 brand emea

Why the gap exists

Validity is not the same as consistency

A DQ tool validates each value against a rule. It cannot resolve meaning across values — that requires understanding that two different strings refer to the same thing in the world.

→ Both AWS and Amazon Web Services, Inc. are non-null

→ Both are valid strings of valid length

→ Neither violates a referential constraint

→ The DQ check passes for both. The problem is invisible.

This is not a gap in your DQ tooling. It is a different problem.

Data quality tools are doing exactly what they were designed to do. Categorical variance sits outside their remit by design. It requires a different approach, one built specifically around semantic resolution and canonical mapping rather than rule-based validation.

06 / 11

Then someone suggests MDM.

Master Data Management comes up in almost every conversation about this problem, usually from someone who has heard the term and reasonably assumes it covers exactly this kind of issue. Sometimes from a vendor whose platform happens to include an MDM module. And it is a fair assumption: MDM does deal with golden records, entity resolution, and master data governance. The question is whether it is the right tool for this specific problem.

The short answer is: MDM is a different problem, solved at a different level, at a different cost, on a different timeline. Understanding what MDM actually does and what it does not prevents an expensive mismatch.

07 / 11

What MDM was actually built for

MDM platforms manage the master records of core business entities: customers, products, suppliers, employees. Their job is to maintain a single golden record for each entity, synchronise that record across systems, manage the full lifecycle of that entity, and govern who can update it and how.

This is a substantial undertaking. MDM implementations typically run for twelve to eighteen months, require dedicated platform administration, and involve deep integration into the systems that hold source entity records. The ROI is real, but it is the ROI of replacing fragmented entity management across the organisation, not of resolving categorical naming variance.

What MDM governs

Entity lifecycle and golden records

MDM answers: who is the authoritative record for this customer? What is the current canonical profile of this supplier? Which product record is the master when three systems hold conflicting attributes?

→ Customer deduplication and golden record management

→ Supplier onboarding workflows and attribute governance

→ Product catalogue mastering across commerce and ERP

→ Bidirectional synchronisation with source systems

What MDM does not govern

How those entities are labelled and categorised

MDM tells you that supplier record #12453 is the golden record for Amazon Web Services. It does not resolve the fact that the procurement system calls them "AWS" and the ERP calls them "Amazon Web Services": two labels for the same golden record that still need mapping.

✗ Variant labels across systems for the same entity

✗ Job title taxonomy across HR systems

✗ Campaign tag normalisation across marketing tools

✗ Support ticket category standardisation

08 / 11

MDM and canonical dimensions are complementary, not competitive.

Organisations that have MDM still accumulate categorical variance. The MDM platform holds the golden supplier record. The dimensions that describe how that supplier is categorised, spend category, contract type, preferred status, are still entered as free text in the systems that record transactions against them. The tool governs the entity. It does not govern the labels.

MDM solves the "which record is authoritative?" problem. Canonical dimensions solve the "how do we label and categorise things consistently?" problem. Both are real. Neither replaces the other.

A concrete example of co-existence

Your MDM platform holds the golden supplier record for Amazon Web Services, Inc., vendor_id V-0041. Your canonical dimension registry maps every label variant, AWS, aws, AMZN, Amazon AWS, back to vendor_id V-0041. The MDM platform tells you everything about the entity. The canonical registry tells you how to find it from any raw value in any source system.

09 / 11

Why the problem outlives every tool that touches it

DQ tooling runs. MDM is in place. Analysts clean files before every report. And the categorical inconsistency is still there six months later. Not because the tools failed, but because none of them were built to own this problem. Three structural reasons explain why it persists.

The backlog problem

Why it doesn't get built

Vendor name: 4,200 raw values

Job title: 1,800 raw values

Campaign tags: 900 raw values

The scope is larger than anyone expected

The context problem

Why tools can't fully automate it

"SL" → Silver tier or SnapLogic?

"ENG" → Engineering or England?

"Core" → product area or contract tier?

Ambiguous cases require business knowledge

The governance problem

Why it doesn't stay built

New vendor onboarded this week

New campaign naming convention

M&A adds a second taxonomy

Without a process, variance accumulates again

10 / 11

A ten-minute test: does your organisation have this problem?

Pick one dimension your team depends on: vendor name, job title, customer segment, or support category. Run these five checks on the raw data in whichever system holds the most records for that dimension.

01

Count distinct values

Run SELECT COUNT(DISTINCT vendor_name) FROM your_table. If the number is significantly larger than the number of actual vendors you do business with, variance is present.

02

Pick one entity you know well and search for it

Take a supplier, job title, or segment you interact with regularly. Filter the column for that value. Count how many different spellings, abbreviations, or casing variants appear.

03

Cross-reference against a second system

Export the same dimension from a second system that records the same data. Do a simple join on the raw value. Count how many records fail to match. That count is your cross-system variance gap.

04

Ask three analysts to clean the same sample

Give ten raw values from the dimension to three analysts and ask them each to produce a canonical form. Compare the results. If the canonical forms differ across analysts, there is no shared standard, which means there is no canonical dimension.

05

Ask: is there a documented standard?

Does a document exist that defines the canonical values for this dimension? Is it maintained? Is it referenced when new records are entered? If the answer to any of these is no, the canonical dimension does not formally exist, even if some individuals have an informal understanding of it.

If any of those five tests found an issue, the problem is present.

It may be small and manageable today. It will be larger and more expensive next year. Categorical variance accumulates. The cost of resolving it grows with the volume of data that depends on that dimension being consistent.

11 / 11

How CleanDims builds the registry without disrupting anything upstream

CleanDims is an AI-native service built specifically to produce canonical dimension registries for organisations. The registry is built alongside the data that already exists. No upstream system is modified. No migration is required. The canonical output is a structured file that joins against the raw data at query time.

The process combines AI-driven pattern matching with human subject matter expertise. The high-confidence cases can be resolved automatically, but the ambiguous cases require context that no algorithm can infer without knowing the organisation. The result is a registry that is both fast to produce and correct in the ways that matter.

AI Pattern Matching

Clusters raw variants, resolves high-confidence matches, flags ambiguous cases for human review with a recommendation attached.

→

CleanDims SMEs

Human data specialists review ambiguous clusters, apply business context, and escalate edge cases to the client's subject matter expert.

→

Client Sign-Off

Every mapping decision is staged for client review before finalisation. The canonical registry is only delivered after explicit approval.

Why dimensional data outlives every tool