cleandimsopen sourceThe dimensional data problem

The dimensional data problem.

Four layers. 34 named failure modes. Citable, deep-linked, and directly connected to the interactive diagnostic.

LAYERS4ENTRIES34LICENSECC BY 4.0

This is the reference catalogue for the dimensional data problem. It is not an essay. It enumerates the specific failure modes observed in practice across four causally-connected layers: the variance in the data, the processes that produce it, the organisational conditions that allow it to persist, and the downstream business consequences.

Each entry is independently anchorable. The URL pattern is /open-source/the-problem#l1-entry-slug. Practitioners cite entries by anchor in retrospectives, RFCs, and incident reviews. Each entry carries explicit causal edges to adjacent layers: the upstream causes that produce it and the downstream effects it produces in turn. The chain reads in either direction.

The interactive companion to this catalogue is Trace the Chain, a diagnostic that walks the same four layers as a guided conversation. The catalogue is the reference; the diagnostic is the tool.

LAYER 18 entries

Data layer.

The variance in the data itself.

The forms dimensional variance takes once it exists. Each entry is a specific shape of inconsistency visible in the values stored in the data, independent of how the inconsistency got there. These are what a practitioner sees when they open a table and look at a column.

L1.1

Same concept, different string

The same entity represented with different casing, abbreviation, spacing, or encoding. Each value means the same thing, but they don't match on a join because the characters differ.

L1.2

Different labels for the same real-world entity

Two strings that are both valid but refer to the same thing. Unlike formatting differences, resolving these requires domain knowledge, because no string operation can determine that Laptop and Notebook are the same product.

L1.3

Multiple concepts crammed into one field

A single field encodes several pieces of information: country, segment, year, and quarter concatenated into one string. The data can't be filtered or aggregated by any individual component until it's decomposed.

L1.5

Correct when entered, but the world changed since

A department was renamed. A product was discontinued. A vendor rebranded. The old labels persist in historical records. Any query using the current name silently drops all records tagged with the old one.

L1.7

Clean, valid value assigned to the wrong category

The value exists in the picklist. The field is populated. It passes every mechanical quality check. But someone tagged the record with the wrong category. Only a person with domain context will catch it.

L1.8

Same label means different things in different contexts

Three departments use the word Enterprise to mean three different things. Two systems use the same code (Department code 100) for different concepts. A merger leaves two products called Platform in one combined namespace. The label is identical, the meanings are not, and the collision is invisible until someone joins the data.

LAYER 29 entries

Process layer.

How the variance is produced and propagates.

The processes that produce the variance in Layer 1 or fail to prevent it. Each entry is a specific mechanism by which inconsistency enters the data, moves between systems, or accumulates over time. Each process failure produces one or more data-layer outcomes.

L2.1

No validation on data entry

The field is a text box with no dropdown, no picklist, and no enforcement against existing values. Every person who enters data types their own format, abbreviation, and naming preference. The single largest source of dimensional variance by volume.

L2.2

The tool won't allow correct entry

A field limited to 30 characters forces abbreviation. No multi-select support forces people to cram values together. Platform defaults like Uncategorised fill the field automatically when the user skips it.

L2.3

Data degrades every time it moves between systems

When data flows through ETL pipelines, APIs, or migrations, encoding changes, fields get dropped, and schema mappings don't align on values. The source data may be clean. The problem happens in the space between systems.

L2.4

The schema itself invites bad data

No canonical reference table means any string is accepted as valid. No hierarchy constraints mean impossible combinations are permitted. No versioning means renames destroy history. The data model removes the guardrails.

L2.5

The world changes but the data doesn't follow

A vendor rebrands from Facebook to Meta. The ERP is updated, but the CRM, expense tool, and contract system still carry the old name. No process exists to cascade corrections across systems.

L2.6

Manual processing steps introduce new variance

Data passes through a spreadsheet, a copy-paste step, or a manually maintained mapping table. Three people maintain the mapping. Each picks a different canonical form. The tool meant to resolve variance becomes a new source of it.

L2.7

Automation and AI produce errors at scale

A rule-based classifier misclassifies edge cases systematically. An ML model trained on outdated labels produces confident, wrong output. Unlike human errors which are random, automation errors are systematic and affect entire populations.

L2.8

Nobody owns the dimension

No individual or team is accountable for the quality of any dimension field. When a problem is found, there's no one to report it to and no one whose job it is to fix it. The dimension exists in a governance vacuum.

LAYER 38 entries

Organisational layer.

Why the process failures persist.

The structural conditions in the organisation that allow Layer 2 process failures to persist. Each entry is a specific organisational pattern that lets the dimensional data problem continue. These are the root causes from which the rest of the chain follows.

L3.7

The organisation grew faster than its data management

New products were added without updating taxonomies. New geographies onboarded with local conventions. Acquisitions brought parallel dimensional structures. The data team added one person. The system count quadrupled.

LAYER 49 entries

Downstream layer.

The business consequences felt by consumers.

The business consequences that follow from the data-layer variance. Each entry is a specific category of harm felt by consumers of the dimensional data, analysts, models, decision-makers, regulators, customers. The catalogue starts here for a reader walking from symptom to cause.

L4.1

Reports don't match across teams

Finance and Sales pull the same revenue metric from different systems and get different totals. The gap isn't analytical. It comes from different segment labels, vendor names, or product categories in each source.

HOW TO USE THIS

Read sequentially to see the full taxonomy. Jump to a specific entry by anchor when citing or referring. Use the produced by chips to walk upstream from a visible problem to its causes; use the produces chips to walk downstream from a suspected root cause to its consequences. Every chip is a deep link to the referenced entry.

The same causal graph powers Trace the Chain, which presents the catalogue as an interactive walk rather than a reference.