Why Dimensional Data Gets Messy · Library · CleanDims

This is the origin story. How inconsistent categorical data forms in an organisation, why it persists once formed, and why it is not anyone's fault. The companion primer establishes that dimensional data is a coherent class of data with characteristic failure modes; this article walks through the everyday mechanics by which one organisation, doing reasonable things at every step, ends up with the same field meaning twenty different things across its systems.

The reader who finishes this article should recognise the pattern in their own organisation, understand why no individual decision along the way was wrong, and have a clearer sense of where in the chain a fix would actually take.

Some dimensions have standards. Most do not.

Every dataset an organisation works with combines two kinds of values. Measures are quantitative: revenue, headcount, spend, latency, page views. Dimensions are categorical: region, vendor, role, segment, ticket type. Measures answer “how much”; dimensions answer “what kind.” A revenue figure is a measure; the country, segment, and product it is associated with are dimensions.

The reason dimensional data behaves the way it does in practice has to do with a distinction that most analyses skip past. Some dimensions have external standards. ISO country codes. Currency codes. SIC and NAICS industry codes. UNSPSC product categorisations. For these, a governing body maintains the full set of valid values. A developer building a form can render a dropdown. A pipeline can validate against a known list. The variance is bounded at the moment of capture, because there is a fixed answer to the question “what are the valid values?”

Most dimensions are not like this. Vendor names, job titles, customer segments, support ticket categories, campaign tags, product taxonomies, internal status codes: no governing body exists. Nobody maintains the canonical list of valid vendor names for an organisation, because the canonical list is whatever that organisation chooses to make it. The developer building the form has nothing external to validate against.

When the dimension is not central to anything important, this is tolerable. A text field absorbs whatever the user types and nobody notices. The trouble starts when the same dimension is asked to do real work, becomes a join key between systems, becomes a grouping field in a report, becomes a feature in a model, and the variance that accumulated in the text field starts producing wrong answers downstream.

Without a standard, the developer reaches for a text box.

The developer building the form has two options. Constrain the input to a controlled list, or leave it as a free-text field. For dimensions with external standards, the controlled list is obvious. For dimensions without external standards, the controlled list is impossible to build, because the set of valid values is not known in advance and will grow over time. A dropdown of vendor names that ships on day one will be out of date by week two. The pragmatic choice, repeated across every operational system every developer has ever built, is the text field.

The text field is the right local decision and the wrong global one. It is the right local decision because the alternative, blocking the user from creating a record until a steward has approved the new value, makes the system unusable for its primary purpose. It is the wrong global decision because every text field eventually contains every plausible variant of every value it has ever received.

The variant of this pattern that compounds hardest is the controlled-input-with-an-uncontrolled-escape-hatch: a dropdown of twenty options with “Other” at the bottom and a text field next to it. The “Other” bucket becomes the largest single category within a few years, and the text field beside it accumulates dozens of spellings of the same five real categories, plus genuine new categories that should have been added to the dropdown years ago but never were. This is worse than pure free-text entry, because the dropdown creates a false sense of control. The data quality dashboard counts twenty values and reports the field as healthy. The truth is that the dropdown is governing five percent of the records and the text field is doing the other ninety-five.

Multiple people, multiple conventions.

Where text-field entry exists, variance follows automatically. The mechanism is not individual carelessness. It is the predictable result of multiple people entering the same concept on different days under different conditions.

Two procurement analysts in the same week, both creating records for the same supplier, will enter the supplier name differently. One will type the legal entity, “Amazon Web Services, Inc.” Another will type the common name, “AWS.” A third, on a Friday afternoon, will type “aws” because they are tired. None of them is wrong. The system accepts every string. The accumulation is six variants of the same supplier in the table, none of which match exactly when a query joins on vendor name.

The same person, on different days, will also enter the same concept differently. A support agent classifying tickets at the start of their shift will pick the precise category from the dropdown. The same agent at the end of their shift will pick the closest plausible category, or “Other,” because the cognitive cost of careful classification is not what they are paid to optimise for. Over a year, the data accumulates the average of attentive and inattentive entry.

This is the human-era version of the variance. The volume is bounded by how fast humans can type and by how many humans are involved. Reactive cleanup, the dominant operating mode, works because the volume fits within what a quarterly project can absorb.

Agents do not solve this; they scale it.

The reasonable assumption when introducing automation is that the variance will go away. Agents do not get tired on Friday afternoon. Agents do not have personal abbreviation preferences. Agents reproduce, faithfully, whatever they were configured to produce.

The reasonable assumption is wrong. Agents reproduce, faithfully and at scale, whatever convention they were configured with. If two agents were configured by two different teams with two different conventions, both agents are internally consistent and the variance between them is the new dominant variance. The error profile shifts from messy and visible to clean and structurally wrong.

The surface variance that dominated human-entered data, the casings and the typos and the trailing whitespace, becomes a smaller share of the problem. Definitional variance, where the same label means different things in different parts of the organisation, becomes a larger share. Two finance agents configured by two regional teams with two different definitions of “Capital Expenditure” will write that label into their respective systems thousands of times per week, each internally consistent, each incompatible with the other. The data looks clean. The numbers do not reconcile.

The deeper change agents bring is not to the variance itself. It is to the feedback loop that used to contain it. In the human era, a misclassified record was eventually noticed by a human analyst who thought the dashboard did not make business sense, traced the wrong number back to the record, and corrected it. The feedback loop is the mechanism by which dimensional inconsistency was held in tension rather than allowed to compound. When agents produce most of the records and other agents consume those records, the human in the middle of the loop is no longer there. Errors propagate without the friction that used to slow them.

The accumulating cost, the recurring cleanup.

Whatever the source of the variance, the downstream consequence is the same. Someone, eventually, has to clean it before the data can be used for anything that depends on the categorical surface being consistent. The cleanup is not a one-time job. New data arrives continuously. The variance that was cleaned last quarter has been replaced by new variance this quarter.

The people doing the cleanup cannot easily propagate corrections upstream. Source systems are owned by other teams; changing the values in the source requires permissions, change-management processes, and political capital that the analyst working on this week's report does not have. So the cleanup stays local. It happens in spreadsheets, in CASE WHEN statements buried inside warehouse models, in personal Python scripts on individual laptops. Three departments doing the same cleanup independently arrive at three different versions of canonical, because they did not agree on which version was canonical and they had no shared registry to coordinate through.

Cross-functional reporting is broken again, this time not because the data was inconsistent but because the cleanups disagree. The finance team's vendor list maps AWS to “Amazon Web Services, Inc.” The sales team's vendor list maps AWS to “Amazon.” The analytics team's vendor list maps AWS to “Amazon (AWS).” When the three teams compare numbers, the numbers do not match, and the disagreement is invisible because everyone believes their list is the right one.

Why this is not anyone’s fault.

A reader following the chain so far might be looking for someone to blame. The procurement analyst who typed “aws” instead of “Amazon Web Services, Inc.” The developer who built the form with a text field instead of a dropdown. The data engineer who did not enforce a controlled vocabulary in the warehouse. The analyst whose CASE WHEN statement disagreed with the finance team's spreadsheet.

None of them is at fault, in the sense that none of them made the wrong local decision. The procurement analyst was entering a record under time pressure and the system accepted the string. The developer chose a text field because the alternative would have blocked the user. The data engineer enforced what could be enforced at the time. The analyst built a CASE WHEN because that was the only intervention available at the layer they had access to.

The problem is structural, not personal. It is the predictable consequence of a system in which the categorical surface is owned by no one, governed at the wrong layer, invisible until it breaks, and structurally underinvested in because nobody's career is built by standardising vendor names. The same conditions are present in nearly every organisation. The pattern is universal because the conditions are universal.

What changes the pattern is not better individual behaviour. It is a different operating model: a canonical reference for the dimensions that matter, named stewards with the authority to decide, an intake process for new values, propagation that reaches every system that produces or consumes the dimension, and the runtime infrastructure that makes the reference operationally load-bearing rather than decorative. That model is the subject of the target state. This article is the predecessor: the description of the problem the model exists to solve.