cleandimslearnfoundations2. The five ways dimensions go wrong

The five ways dimensions go wrong.

If dimensions were always entered consistently, the dimensional data problem would not exist. The interesting question is not whether they go wrong but in what specific ways.

PAGE2 of 7MODULEFoundationsREADING TIME~ 7 min

If dimensions were always entered consistently, the dimensional data problem would not exist. The interesting question is not whether they go wrong but in what specific ways. There are five common forms of inconsistency, each requiring a different kind of resolution. Understanding which form a particular inconsistency takes is the first step toward fixing it.

Surface variance.

The same concept, recorded with different formatting. “AWS” and “aws” and “A.W.S.” and “Amazon Web Services” all refer to the same supplier. The strings differ; the meaning is the same.

Surface variance is the easiest form to recognise. A human looking at the data can usually tell that the values are equivalent. It is also the easiest form to resolve, because the resolution is mechanical: pick a canonical form and apply a normalisation rule (lowercase, trim whitespace, expand known abbreviations).

If this were the only form of variance, the dimensional data problem would be a tooling problem with an obvious solution. It is not the only form.

Semantic variance.

The same concept, recorded with genuinely different labels. “Laptop” and “Notebook” are not formatting variants; they are synonyms that different teams independently adopted. “Freelancer” and “Independent Contractor” and “1099 Worker” are three correct labels for the same employment category. “Customer Acquisition Cost” and “CAC” and “Cost per Acquired Customer” all refer to the same metric.

Resolving semantic variance requires picking which label is canonical. The pick is not deterministic. There is no formula that says “Laptop wins over Notebook.” The decision requires human judgement, and usually organisational agreement, because different teams will have preferences that reflect their own conventions.

Definitional variance.

The same label, used for different things in different parts of the organisation. The sales team's definition of “Mid-Market” is companies between 100 and 999 employees. The finance team's definition is companies between $1M and $10M in annual revenue. The marketing team's definition is whatever the customer self-reported on a form. All three teams use the same word; the underlying populations are incompatible.

This is the most dangerous form of variance because it produces no visible inconsistency. The data looks clean. Reports run without errors. The conclusions drawn from the reports are nevertheless built on incompatible definitions of the categories being compared. A “Mid-Market” total that blends all three definitions is meaningless, and the meaninglessness is invisible.

Definitional variance cannot be resolved with normalisation rules or with synonym maps. It can only be resolved by getting the teams that use the label to agree on what it means, and then enforcing the agreement everywhere the label appears.

Granularity variance.

The same concept, recorded at different levels of detail. The CRM records industry as “Financial Services.” The data warehouse records it as “Retail Banking.” An external data provider uses “BFSI.” Each value is correct at its own level. The problem is that they cannot be aggregated together without first agreeing on which level is canonical.

Geographic granularity is the canonical example: city, state, country, region are four levels of the same dimension. A record captured at city level has more detail than a record captured at country level, but a query joining two systems at different levels has to either drop down to the lowest common denominator (losing precision) or attempt to roll up the more-detailed values (which requires the hierarchy to be explicit, which it rarely is).

Temporal variance.

The canonical form changes over time. A product is renamed. A department is restructured. A vendor rebrands. The new label is applied to new records. The old records keep the old label. A query filtering on the new label silently drops historical data tagged with the old one.

Temporal variance is invisible because each individual record is correct for the time it was created. The inconsistency emerges only when records from different time periods are analysed together, and the analysis assumes that the labels are stable when they are not.

How the forms compound.

A single dimension typically suffers from several forms of variance at once. The casing is inconsistent (surface). The synonyms have not been reconciled (semantic). The definition has drifted across teams (definitional). The levels of detail vary across systems (granularity). The labels have changed over the years (temporal).

Treating these as a single “data quality problem” produces partial solutions that look like progress and leave most of the underlying inconsistency intact. The eight-form taxonomy in Layer 1 of the problem catalogue extends the list to include three additional forms (structural contamination, missing values, incorrect classification, cross-domain collision). The five above are the core; the catalogue is the full reference.

GOING DEEPER