Dimension management manifesto

Dimension management is the discipline of governing the categorical surface of business systems. The CleanDims manifesto argues that dimensional data has been managed reactively for thirty years, that the cost was bounded by humans being the dominant producers of categorical values, and that the shift to agent-produced data is about to make reactive management structurally unsustainable.

I. The four variants.

Open the customer table in any warehouse and look at the cloud provider field. Where the records mean Amazon Web Services, you will find the four variants. AWS. Amazon Web Services. amazon-web-services. The three-character code that an external data vendor stamps on every row it sells you. Same customer, same provider, four labels.

The query that powers the quarterly report joins on the field. The analyst writes a CASE WHEN that collapses three of the four; the fourth is the one the analyst did not know about. The report goes to the CFO with a number that is wrong by some amount. Nobody knows by how much, because nobody knows the fourth variant exists in this table.

This is the dimensional data problem in its most familiar form, and it has been the dimensional data problem for thirty years.

II. The category nobody named.

Categorical values are the axes along which every business is analysed. Customer segment, vendor name, product category, ticket type, expense code, deal stage. They are the join keys between systems and the features that models train on. They are the columns that determine whether two records describe the same kind of thing or different kinds of thing, and the values inside them are how that determination is made.

The discipline of managing these values has never been named with the seriousness it deserves. Master data management took the name and the budget, but MDM solves a different problem: whether two records refer to the same entity. Data quality monitoring took the next-most-likely name, but DQ tools detect problems after they have entered the data. Catalog tools describe what data assets exist; they do not govern the values that populate them. Governance platforms write the policy at which the values are managed; they do not run the management itself.

What is left is the work of deciding what an accepted value is, recording that decision somewhere that the systems producing the values can read, and keeping the record current as the business changes. This work has, for thirty years, been done as a project rather than as a discipline. Twice a year a team is assigned to clean up the customer segments, or the vendor names, or the product categories, and twice a year the same team produces a deck about how much variance they removed. Six months later the variance has come back, because the systems that produced it are still producing it.

The work has a name now. Dimension management. The practice of treating the categorical surface of every business system as a governed, versioned, operationally load-bearing layer of the infrastructure rather than as a janitorial problem the analytics team handles in its slack hours.

III. Why reactive worked.

For thirty years, dimensional inconsistency was tolerable. Not because the cost was zero, but because the cost was bounded by the speed at which humans could create new variants. A sales representative typing a free-text industry into a CRM produces one variant. Three years of representatives typing produces a few thousand. A few thousand variants can be cleaned up by a well-organised quarterly project, and a well-organised quarterly project fits inside the budget of the analytics team that already exists.

The cost of the variants was visible only when they reached the report, which was after the fact, in a setting where the analyst was paid to clean them up. The cost of the cleanup was a line item that looked like a project and got renewed every year. Nobody had to decide whether the underlying practice was right, because the accounting hid the question. Reactive worked, in the limited sense that the bill came due slowly enough that it could be paid out of existing budgets without a strategic decision.

IV. Why reactive stops working now.

The bound on the rate of variant creation was the speed at which humans can type. That bound is gone.

An agent classifying expense reports produces, in an hour, more dimensional values than a team of analysts produced in a quarter. The agent classifies faithfully against whatever convention it was configured with, which is to say it reproduces, at scale and at machine speed, whichever ambiguity, whichever shorthand, whichever plausible-but-wrong category was in the prompt it was given.

The same approach that has worked, in some sense, for thirty years will not work for the next ten. The accumulation rate has gone up by a factor that no quarterly project can absorb, and the source of the accumulation has become an artifact that learns. The reactive cleanup model assumes that the producer of variants is a human who can be retrained between cleanups. The agent-era producer does not retrain itself between cleanups; it reproduces, at scale, whatever its last instruction said.

Compounding this: agents are also the consumers of dimensional data. A churn model trained on a customer segment field where Financial Services, Banking & Finance, BFSI, and an SIC code all coexist learns that the segment field is its own predictor of something. The model is not wrong; it is correctly learning that whichever segment label appears is correlated with whichever system produced the record. That signal is noise dressed as feature, and it will be inside the model until the underlying dimensional data is governed.

V. What we are proposing.

Dimension management is the practice of governing the categorical surface in production rather than in retrospect. Four commitments follow from this framing, each developed at length in the documents the manifesto introduces.

A canonical reference for every dimension that matters. Not a list in a wiki, not a column in a spreadsheet. A versioned, owned, API-readable record of the accepted values, with aliases, with definitions, with a change log. The reference is operationally load-bearing rather than decorative.
Runtime, not retrospect. Agents and pipelines read from local caches that synchronise continuously to the reference, classify at memory speed, and emit output with a confidence signal that reflects how stale the cache is. The classification happens inside the production flow, at the moment the value is produced.
Stewardship scaled to part-time. Every dimension that matters has a named steward. The steward is one person, not a committee, holding decision rights. The workflow scales down so that holding a stewardship is an addition to a domain expert's role, not a new full-time function.
The discipline, written down. The practice belongs to the category, not to any one product. The manifesto, the primer, the target state, and the four-layer taxonomy of what goes wrong are published. The library is open. The interactive diagnostic is open. The product CleanDims is building is the implementation; the practice is open to everyone who will eventually depend on it.

VI. What this is not.

This is not a claim that dimensional data is the most important data in any business. Transactional data is. Customer records are. Financial records are. The dimensional data is the layer those other kinds of data are described and classified and segmented by, which makes it consequential out of proportion to its volume. It is the load-bearing layer that holds the other kinds of data up; when it fails, the failure is felt everywhere downstream, but it is rarely the layer that gets the budget.

This is also not a claim that nothing has worked in the last thirty years. The MDM vendors built a real category. The DQ vendors built a real category. The catalog vendors built a real category. Each of those categories solved a part of the problem, and the parts they solved are still solved. What was left unaddressed is the part that matters most in the agent era: the categorical surface, governed in production, at the speed and scale that agents operate.

This is also not a claim that anyone has been doing it wrong. The rate of variant creation in a human-only era genuinely was bounded by typing speed. The reactive cleanup model genuinely did work, in the sense that it kept the bill payable. What is changing is the rate of accumulation, and what has to change in response is the model.

VII. What to read next.

The manifesto is one of four foundational documents. The primer establishes the category from first principles for a reader who is encountering dimensional data as a discipline for the first time. The target state specifies what good looks like at the level of the canonical reference, the runtime, the workflow, and the metrics that measure whether the system is working. The four-layer taxonomy catalogues the specific failure modes by layer: data, process, organisational, downstream.

The argument made here is meant to be tested. The library, the diagnostic, and the product surface are how that testing happens in practice.

The categorical surface, governed in production.

I. The four variants.

II. The category nobody named.

III. Why reactive worked.

IV. Why reactive stops working now.

V. What we are proposing.

VI. What this is not.

VII. What to read next.

The primer.

The target state.

The four-layer taxonomy.