cleandimsopen sourceThe dimensional data problem

The dimensional data problem.

Four layers. 34 named failure modes. Citable, deep-linked, and directly connected to the interactive diagnostic.

LAYERS4ENTRIES34LICENSECC BY 4.0

This is the reference catalogue for the dimensional data problem. It is not an essay. It enumerates the specific failure modes observed in practice across four causally-connected layers: the variance in the data, the processes that produce it, the organisational conditions that allow it to persist, and the downstream business consequences.

Each entry is independently anchorable. The URL pattern is /open-source/the-problem#l1-entry-slug. Practitioners cite entries by anchor in retrospectives, RFCs, and incident reviews. Each entry carries explicit causal edges to adjacent layers: the upstream causes that produce it and the downstream effects it produces in turn. The chain reads in either direction.

The interactive companion to this catalogue is Trace the Chain, a diagnostic that walks the same four layers as a guided conversation. The catalogue is the reference; the diagnostic is the tool.

LAYER 18 entries

Data layer.

The variance in the data itself.

The forms dimensional variance takes once it exists. Each entry is a specific shape of inconsistency visible in the values stored in the data, independent of how the inconsistency got there. These are what a practitioner sees when they open a table and look at a column.

L1.1

Same concept, different string

The same entity represented with different casing, abbreviation, spacing, or encoding. Each value means the same thing, but they don't match on a join because the characters differ.

PRODUCED BY

L2No validation on data entry L2The tool won't allow correct entry L2Data degrades every time it moves between systems L2The schema itself invites bad data L2Manual processing steps introduce new variance L2The people entering data don't know the standard exists

PRODUCES

L4Reports don't match across teams L4Analysts spend days on cleanup before analysis L4Cross-system joins fail silently L4Self-service dashboards are abandoned

L1.2

Different labels for the same real-world entity

Two strings that are both valid but refer to the same thing. Unlike formatting differences, resolving these requires domain knowledge, because no string operation can determine that Laptop and Notebook are the same product.

PRODUCED BY

L2No validation on data entry L2Data degrades every time it moves between systems L2The schema itself invites bad data L2Manual processing steps introduce new variance L2Automation and AI produce errors at scale L2Nobody owns the dimension

PRODUCES

L4Reports don't match across teams L4Analysts spend days on cleanup before analysis L4Cross-system joins fail silently L4Vendor spend is fragmented and invisible L4Self-service dashboards are abandoned L4ML models underperform for unclear reasons

L1.3

Multiple concepts crammed into one field

A single field encodes several pieces of information: country, segment, year, and quarter concatenated into one string. The data can't be filtered or aggregated by any individual component until it's decomposed.

PRODUCED BY

L2The tool won't allow correct entry L2The schema itself invites bad data L2The people entering data don't know the standard exists

PRODUCES

L4Analysts spend days on cleanup before analysis L4Cross-system joins fail silently L4Strategic cross-cutting analysis is blocked

L1.4

Same dimension captured at different levels of detail

One system records London, another records United Kingdom, another records EMEA. Each is correct in its own context, but they represent different levels of the same hierarchy and can't be joined without an explicit mapping.

PRODUCED BY

L2Data degrades every time it moves between systems L2The schema itself invites bad data

PRODUCES

L4Reports don't match across teams L4Cross-system joins fail silently L4Strategic cross-cutting analysis is blocked

L1.5

Correct when entered, but the world changed since

A department was renamed. A product was discontinued. A vendor rebranded. The old labels persist in historical records. Any query using the current name silently drops all records tagged with the old one.

PRODUCED BY

L2The schema itself invites bad data L2The world changes but the data doesn't follow L2Automation and AI produce errors at scale

PRODUCES

L4Reports don't match across teams L4Strategic cross-cutting analysis is blocked L4Compliance can't produce defensible numbers

L1.6

Missing data disguised as real values

The field isn't empty, but the value carries no information: N/A, TBD, Uncategorised, or a system default. These pass completeness checks, inflate distinct value counts, and hide the fact that the data was never actually captured.

PRODUCED BY

L2No validation on data entry L2The tool won't allow correct entry L2Manual processing steps introduce new variance

PRODUCES

L4Analysts spend days on cleanup before analysis L4ML models underperform for unclear reasons

L1.7

Clean, valid value assigned to the wrong category

The value exists in the picklist. The field is populated. It passes every mechanical quality check. But someone tagged the record with the wrong category. Only a person with domain context will catch it.

PRODUCED BY

L2No validation on data entry L2The tool won't allow correct entry L2The world changes but the data doesn't follow L2Manual processing steps introduce new variance L2Automation and AI produce errors at scale L2Nobody owns the dimension L2The people entering data don't know the standard exists

PRODUCES

L4Reports don't match across teams L4Compliance can't produce defensible numbers L4The organisation has stopped trusting its data L4ML models underperform for unclear reasons

L1.8

Same label means different things in different contexts

Three departments use the word Enterprise to mean three different things. Two systems use the same code (Department code 100) for different concepts. A merger leaves two products called Platform in one combined namespace. The label is identical, the meanings are not, and the collision is invisible until someone joins the data.

PRODUCED BY

L2Data degrades every time it moves between systems L2The schema itself invites bad data L2The world changes but the data doesn't follow L2Nobody owns the dimension

PRODUCES

L4Reports don't match across teams L4Strategic cross-cutting analysis is blocked L4The organisation has stopped trusting its data

LAYER 29 entries

Process layer.

How the variance is produced and propagates.

The processes that produce the variance in Layer 1 or fail to prevent it. Each entry is a specific mechanism by which inconsistency enters the data, moves between systems, or accumulates over time. Each process failure produces one or more data-layer outcomes.

L2.1

No validation on data entry

The field is a text box with no dropdown, no picklist, and no enforcement against existing values. Every person who enters data types their own format, abbreviation, and naming preference. The single largest source of dimensional variance by volume.

PRODUCED BY

L3Incentives reward speed over accuracy L3The organisation grew faster than its data management L3Data debt is known, acknowledged, and perpetually deferred

PRODUCES

L1Same concept, different string L1Different labels for the same real-world entity L1Missing data disguised as real values L1Clean, valid value assigned to the wrong category

L2.2

The tool won't allow correct entry

A field limited to 30 characters forces abbreviation. No multi-select support forces people to cram values together. Platform defaults like Uncategorised fill the field automatically when the user skips it.

PRODUCED BY

L3Data infrastructure is chronically underfunded L3Data debt is known, acknowledged, and perpetually deferred

PRODUCES

L1Same concept, different string L1Multiple concepts crammed into one field L1Missing data disguised as real values L1Clean, valid value assigned to the wrong category

L2.3

Data degrades every time it moves between systems

When data flows through ETL pipelines, APIs, or migrations, encoding changes, fields get dropped, and schema mappings don't align on values. The source data may be clean. The problem happens in the space between systems.

PRODUCED BY

L3Departments operate as data silos L3Data infrastructure is chronically underfunded L3The organisation grew faster than its data management

PRODUCES

L1Same concept, different string L1Different labels for the same real-world entity L1Same dimension captured at different levels of detail L1Same label means different things in different contexts

L2.4

The schema itself invites bad data

No canonical reference table means any string is accepted as valid. No hierarchy constraints mean impossible combinations are permitted. No versioning means renames destroy history. The data model removes the guardrails.

PRODUCED BY

L3No executive sponsors data quality L3The cost of bad data is invisible L3Data infrastructure is chronically underfunded

PRODUCES

L1Same concept, different string L1Different labels for the same real-world entity L1Multiple concepts crammed into one field L1Same dimension captured at different levels of detail L1Correct when entered, but the world changed since L1Same label means different things in different contexts

L2.5

The world changes but the data doesn't follow

A vendor rebrands from Facebook to Meta. The ERP is updated, but the CRM, expense tool, and contract system still carry the old name. No process exists to cascade corrections across systems.

PRODUCED BY

L3No executive sponsors data quality L3Workarounds are accepted as the process L3The organisation grew faster than its data management

PRODUCES

L1Correct when entered, but the world changed since L1Clean, valid value assigned to the wrong category L1Same label means different things in different contexts

L2.6

Manual processing steps introduce new variance

Data passes through a spreadsheet, a copy-paste step, or a manually maintained mapping table. Three people maintain the mapping. Each picks a different canonical form. The tool meant to resolve variance becomes a new source of it.

PRODUCED BY

L3Workarounds are accepted as the process L3Data infrastructure is chronically underfunded

PRODUCES

L1Same concept, different string L1Different labels for the same real-world entity L1Missing data disguised as real values L1Clean, valid value assigned to the wrong category

L2.7

Automation and AI produce errors at scale

A rule-based classifier misclassifies edge cases systematically. An ML model trained on outdated labels produces confident, wrong output. Unlike human errors which are random, automation errors are systematic and affect entire populations.

PRODUCED BY

L3No executive sponsors data quality L3Data infrastructure is chronically underfunded L3The organisation grew faster than its data management L3Data debt is known, acknowledged, and perpetually deferred

PRODUCES

L1Different labels for the same real-world entity L1Correct when entered, but the world changed since L1Clean, valid value assigned to the wrong category

L2.8

Nobody owns the dimension

No individual or team is accountable for the quality of any dimension field. When a problem is found, there's no one to report it to and no one whose job it is to fix it. The dimension exists in a governance vacuum.

PRODUCED BY

L3No executive sponsors data quality L3Incentives reward speed over accuracy L3Departments operate as data silos L3Workarounds are accepted as the process L3The organisation grew faster than its data management L3Data debt is known, acknowledged, and perpetually deferred

PRODUCES

L1Different labels for the same real-world entity L1Clean, valid value assigned to the wrong category L1Same label means different things in different contexts

L2.9

The people entering data don't know the standard exists

A canonical naming convention was defined and published in a governance wiki. But the sales reps, support agents, and marketing managers who actually enter data were never told about it.

PRODUCED BY

L3Incentives reward speed over accuracy L3The organisation grew faster than its data management

PRODUCES

L1Same concept, different string L1Multiple concepts crammed into one field L1Clean, valid value assigned to the wrong category

LAYER 38 entries

Organisational layer.

Why the process failures persist.

The structural conditions in the organisation that allow Layer 2 process failures to persist. Each entry is a specific organisational pattern that lets the dimensional data problem continue. These are the root causes from which the rest of the chain follows.

L3.1

No executive sponsors data quality

Schema redesigns, change management processes, and governance staffing all require sustained funding. Without an executive sponsor, these initiatives compete against product features and infrastructure projects, and lose every time.

PRODUCES

L2The schema itself invites bad data L2The world changes but the data doesn't follow L2Automation and AI produce errors at scale L2Nobody owns the dimension

L3.2

Incentives reward speed over accuracy

Sales reps are measured on deals closed, not CRM hygiene. Support agents are measured on tickets per hour, not categorisation accuracy. Taking 30 extra seconds to select the correct value actively works against the KPI.

PRODUCES

L2No validation on data entry L2Nobody owns the dimension L2The people entering data don't know the standard exists

L3.3

The cost of bad data is invisible

An analyst spends three days reconciling vendor names before running a report. That time is absorbed into the project timeline. It's never isolated, measured, or reported as waste. Without a number, there's no business case.

PRODUCES

L2The schema itself invites bad data

L3.4

Departments operate as data silos

Sales configures Salesforce with their own picklists. Finance configures the ERP with a different scheme. Marketing sets up HubSpot independently. No shared data model exists across functions. Cross-functional data requests are treated as favours.

PRODUCES

L2Data degrades every time it moves between systems L2Nobody owns the dimension

L3.5

Workarounds are accepted as the process

Every analyst has their own CASE WHEN logic and personal mapping spreadsheet. Finance manually adjusts the vendor report before every board meeting. These have been in place so long they are the process. Nobody questions them.

PRODUCES

L2The world changes but the data doesn't follow L2Manual processing steps introduce new variance L2Nobody owns the dimension

L3.6

Data infrastructure is chronically underfunded

The organisation invests heavily in CRM, ERP, and BI tools. It invests nothing in the connective tissue between them: canonical registries, semantic layers, data quality monitoring. The plumbing is invisible, so it's neglected.

PRODUCES

L2The tool won't allow correct entry L2Data degrades every time it moves between systems L2The schema itself invites bad data L2Manual processing steps introduce new variance L2Automation and AI produce errors at scale

L3.7

The organisation grew faster than its data management

New products were added without updating taxonomies. New geographies onboarded with local conventions. Acquisitions brought parallel dimensional structures. The data team added one person. The system count quadrupled.

PRODUCES

L2No validation on data entry L2Data degrades every time it moves between systems L2The world changes but the data doesn't follow L2Automation and AI produce errors at scale L2Nobody owns the dimension L2The people entering data don't know the standard exists

L3.8

Data debt is known, acknowledged, and perpetually deferred

The dimensional data problems are in the retro notes. They're mentioned in quarterly reviews. Everyone agrees they should be fixed. They never reach the top of the priority list because they don't block a release and don't have a quantified impact.

PRODUCES

L2No validation on data entry L2The tool won't allow correct entry L2Automation and AI produce errors at scale L2Nobody owns the dimension

LAYER 49 entries

Downstream layer.

The business consequences felt by consumers.

The business consequences that follow from the data-layer variance. Each entry is a specific category of harm felt by consumers of the dimensional data, analysts, models, decision-makers, regulators, customers. The catalogue starts here for a reader walking from symptom to cause.

L4.1

Reports don't match across teams

Finance and Sales pull the same revenue metric from different systems and get different totals. The gap isn't analytical. It comes from different segment labels, vendor names, or product categories in each source.

PRODUCED BY

L1Same concept, different string L1Different labels for the same real-world entity L1Same dimension captured at different levels of detail L1Correct when entered, but the world changed since L1Clean, valid value assigned to the wrong category L1Same label means different things in different contexts

L4.2

Cross-system joins fail silently

A query joining CRM to ERP drops records because the same entity carries different names in each system. No error is thrown. The output looks complete, but every mismatched label is a silently missing row.

PRODUCED BY

L1Same concept, different string L1Different labels for the same real-world entity L1Multiple concepts crammed into one field L1Same dimension captured at different levels of detail

L4.3

Analysts spend days on cleanup before analysis

Analysts spend 30-50% of their time reconciling dimension values before they can start actual analysis. The same CASE WHEN statements get written independently by different people for different reports, every cycle.

PRODUCED BY

L1Same concept, different string L1Different labels for the same real-world entity L1Multiple concepts crammed into one field L1Missing data disguised as real values

L4.4

Vendor spend is fragmented and invisible

The same supplier appears under three different names across procurement, finance, and expense systems. Each entry shows $600K. None crosses the threshold for strategic review. Consolidated, the $1.8M total qualifies for volume discounts.

PRODUCED BY

L1Different labels for the same real-world entity

L4.5

Compliance can't produce defensible numbers

A regulator asks for a count of active contracts by type. The answer depends on whether MSA, Master Service Agreement, and Master Services Agreement count as the same type. The organisation can't produce a defensible number.

PRODUCED BY

L1Correct when entered, but the world changed since L1Clean, valid value assigned to the wrong category

L4.6

ML models underperform for unclear reasons

A churn prediction model uses industry as a feature. The field contains 47 label variants of 10 real industries. The model treats each variant as a distinct category, diluting the predictive signal. The team blames the algorithm.

PRODUCED BY

L1Different labels for the same real-world entity L1Missing data disguised as real values L1Clean, valid value assigned to the wrong category

L4.7

Self-service dashboards are abandoned

Filter dropdowns show 47 vendor name variants instead of 10 clean entries. Drill-downs produce unexpected results. Users lose trust in the dashboard and email the analyst directly, defeating the purpose of self-service.

PRODUCED BY

L1Same concept, different string L1Different labels for the same real-world entity

L4.8

Strategic cross-cutting analysis is blocked

Trend analysis breaks at every rename boundary. Cross-functional analysis (like correlating support volume with churn by segment) is blocked because segment definitions differ across the systems involved.

PRODUCED BY

L1Multiple concepts crammed into one field L1Same dimension captured at different levels of detail L1Correct when entered, but the world changed since L1Same label means different things in different contexts

L4.9

The organisation has stopped trusting its data

Stakeholders default to intuition and personal spreadsheets because past reports were wrong. The data team loses influence. New data initiatives face scepticism. Trust recovery takes longer than the remediation itself.

PRODUCED BY

L1Clean, valid value assigned to the wrong category L1Same label means different things in different contexts

HOW TO USE THIS

Read sequentially to see the full taxonomy. Jump to a specific entry by anchor when citing or referring. Use the produced by chips to walk upstream from a visible problem to its causes; use the produces chips to walk downstream from a suspected root cause to its consequences. Every chip is a deep link to the referenced entry.

The same causal graph powers Trace the Chain, which presents the catalogue as an interactive walk rather than a reference.