The dimensional data problem.
Four layers. 34 named failure modes. Citable, deep-linked, and directly connected to the interactive diagnostic.
This is the reference catalogue for the dimensional data problem. It is not an essay. It enumerates the specific failure modes observed in practice across four causally-connected layers: the variance in the data, the processes that produce it, the organisational conditions that allow it to persist, and the downstream business consequences.
Each entry is independently anchorable. The URL pattern is /open-source/the-problem#l1-entry-slug. Practitioners cite entries by anchor in retrospectives, RFCs, and incident reviews. Each entry carries explicit causal edges to adjacent layers: the upstream causes that produce it and the downstream effects it produces in turn. The chain reads in either direction.
The interactive companion to this catalogue is Trace the Chain, a diagnostic that walks the same four layers as a guided conversation. The catalogue is the reference; the diagnostic is the tool.
Data layer.
The variance in the data itself.
The forms dimensional variance takes once it exists. Each entry is a specific shape of inconsistency visible in the values stored in the data, independent of how the inconsistency got there. These are what a practitioner sees when they open a table and look at a column.
Same concept, different string
The same entity represented with different casing, abbreviation, spacing, or encoding. Each value means the same thing, but they don't match on a join because the characters differ.
Different labels for the same real-world entity
Two strings that are both valid but refer to the same thing. Unlike formatting differences, resolving these requires domain knowledge, because no string operation can determine that Laptop and Notebook are the same product.
Multiple concepts crammed into one field
A single field encodes several pieces of information: country, segment, year, and quarter concatenated into one string. The data can't be filtered or aggregated by any individual component until it's decomposed.
Same dimension captured at different levels of detail
One system records London, another records United Kingdom, another records EMEA. Each is correct in its own context, but they represent different levels of the same hierarchy and can't be joined without an explicit mapping.
Correct when entered, but the world changed since
A department was renamed. A product was discontinued. A vendor rebranded. The old labels persist in historical records. Any query using the current name silently drops all records tagged with the old one.
Missing data disguised as real values
The field isn't empty, but the value carries no information: N/A, TBD, Uncategorised, or a system default. These pass completeness checks, inflate distinct value counts, and hide the fact that the data was never actually captured.
Clean, valid value assigned to the wrong category
The value exists in the picklist. The field is populated. It passes every mechanical quality check. But someone tagged the record with the wrong category. Only a person with domain context will catch it.
Same label means different things in different contexts
Three departments use the word Enterprise to mean three different things. Two systems use the same code (Department code 100) for different concepts. A merger leaves two products called Platform in one combined namespace. The label is identical, the meanings are not, and the collision is invisible until someone joins the data.
Process layer.
How the variance is produced and propagates.
The processes that produce the variance in Layer 1 or fail to prevent it. Each entry is a specific mechanism by which inconsistency enters the data, moves between systems, or accumulates over time. Each process failure produces one or more data-layer outcomes.
No validation on data entry
The field is a text box with no dropdown, no picklist, and no enforcement against existing values. Every person who enters data types their own format, abbreviation, and naming preference. The single largest source of dimensional variance by volume.
The tool won't allow correct entry
A field limited to 30 characters forces abbreviation. No multi-select support forces people to cram values together. Platform defaults like Uncategorised fill the field automatically when the user skips it.
Data degrades every time it moves between systems
When data flows through ETL pipelines, APIs, or migrations, encoding changes, fields get dropped, and schema mappings don't align on values. The source data may be clean. The problem happens in the space between systems.
The schema itself invites bad data
No canonical reference table means any string is accepted as valid. No hierarchy constraints mean impossible combinations are permitted. No versioning means renames destroy history. The data model removes the guardrails.
The world changes but the data doesn't follow
A vendor rebrands from Facebook to Meta. The ERP is updated, but the CRM, expense tool, and contract system still carry the old name. No process exists to cascade corrections across systems.
Manual processing steps introduce new variance
Data passes through a spreadsheet, a copy-paste step, or a manually maintained mapping table. Three people maintain the mapping. Each picks a different canonical form. The tool meant to resolve variance becomes a new source of it.
Automation and AI produce errors at scale
A rule-based classifier misclassifies edge cases systematically. An ML model trained on outdated labels produces confident, wrong output. Unlike human errors which are random, automation errors are systematic and affect entire populations.
Nobody owns the dimension
No individual or team is accountable for the quality of any dimension field. When a problem is found, there's no one to report it to and no one whose job it is to fix it. The dimension exists in a governance vacuum.
The people entering data don't know the standard exists
A canonical naming convention was defined and published in a governance wiki. But the sales reps, support agents, and marketing managers who actually enter data were never told about it.
Organisational layer.
Why the process failures persist.
The structural conditions in the organisation that allow Layer 2 process failures to persist. Each entry is a specific organisational pattern that lets the dimensional data problem continue. These are the root causes from which the rest of the chain follows.
No executive sponsors data quality
Schema redesigns, change management processes, and governance staffing all require sustained funding. Without an executive sponsor, these initiatives compete against product features and infrastructure projects, and lose every time.
Incentives reward speed over accuracy
Sales reps are measured on deals closed, not CRM hygiene. Support agents are measured on tickets per hour, not categorisation accuracy. Taking 30 extra seconds to select the correct value actively works against the KPI.
The cost of bad data is invisible
An analyst spends three days reconciling vendor names before running a report. That time is absorbed into the project timeline. It's never isolated, measured, or reported as waste. Without a number, there's no business case.
Departments operate as data silos
Sales configures Salesforce with their own picklists. Finance configures the ERP with a different scheme. Marketing sets up HubSpot independently. No shared data model exists across functions. Cross-functional data requests are treated as favours.
Workarounds are accepted as the process
Every analyst has their own CASE WHEN logic and personal mapping spreadsheet. Finance manually adjusts the vendor report before every board meeting. These have been in place so long they are the process. Nobody questions them.
Data infrastructure is chronically underfunded
The organisation invests heavily in CRM, ERP, and BI tools. It invests nothing in the connective tissue between them: canonical registries, semantic layers, data quality monitoring. The plumbing is invisible, so it's neglected.
The organisation grew faster than its data management
New products were added without updating taxonomies. New geographies onboarded with local conventions. Acquisitions brought parallel dimensional structures. The data team added one person. The system count quadrupled.
Data debt is known, acknowledged, and perpetually deferred
The dimensional data problems are in the retro notes. They're mentioned in quarterly reviews. Everyone agrees they should be fixed. They never reach the top of the priority list because they don't block a release and don't have a quantified impact.
Downstream layer.
The business consequences felt by consumers.
The business consequences that follow from the data-layer variance. Each entry is a specific category of harm felt by consumers of the dimensional data, analysts, models, decision-makers, regulators, customers. The catalogue starts here for a reader walking from symptom to cause.
Reports don't match across teams
Finance and Sales pull the same revenue metric from different systems and get different totals. The gap isn't analytical. It comes from different segment labels, vendor names, or product categories in each source.
Cross-system joins fail silently
A query joining CRM to ERP drops records because the same entity carries different names in each system. No error is thrown. The output looks complete, but every mismatched label is a silently missing row.
Analysts spend days on cleanup before analysis
Analysts spend 30-50% of their time reconciling dimension values before they can start actual analysis. The same CASE WHEN statements get written independently by different people for different reports, every cycle.
Vendor spend is fragmented and invisible
The same supplier appears under three different names across procurement, finance, and expense systems. Each entry shows $600K. None crosses the threshold for strategic review. Consolidated, the $1.8M total qualifies for volume discounts.
Compliance can't produce defensible numbers
A regulator asks for a count of active contracts by type. The answer depends on whether MSA, Master Service Agreement, and Master Services Agreement count as the same type. The organisation can't produce a defensible number.
ML models underperform for unclear reasons
A churn prediction model uses industry as a feature. The field contains 47 label variants of 10 real industries. The model treats each variant as a distinct category, diluting the predictive signal. The team blames the algorithm.
Self-service dashboards are abandoned
Filter dropdowns show 47 vendor name variants instead of 10 clean entries. Drill-downs produce unexpected results. Users lose trust in the dashboard and email the analyst directly, defeating the purpose of self-service.
Strategic cross-cutting analysis is blocked
Trend analysis breaks at every rename boundary. Cross-functional analysis (like correlating support volume with churn by segment) is blocked because segment definitions differ across the systems involved.
The organisation has stopped trusting its data
Stakeholders default to intuition and personal spreadsheets because past reports were wrong. The data team loses influence. New data initiatives face scepticism. Trust recovery takes longer than the remediation itself.
Read sequentially to see the full taxonomy. Jump to a specific entry by anchor when citing or referring. Use the produced by chips to walk upstream from a visible problem to its causes; use the produces chips to walk downstream from a suspected root cause to its consequences. Every chip is a deep link to the referenced entry.
The same causal graph powers Trace the Chain, which presents the catalogue as an interactive walk rather than a reference.