The Variance Evolution Model · Library · CleanDims

Dimensional variance does not appear all at once. It accumulates through phases, and the phases are recognisable. Identifying which phase a given dimension is in is the diagnostic that determines what intervention will work and what will not. This article describes the four phases, the conditions that drive the transition from one to the next, and what each phase looks like in practice.

The model is descriptive rather than predictive. Not every dimension passes through every phase in order; some skip phases, some regress, some stay in one phase indefinitely. What the model offers is a vocabulary for naming where a specific dimension currently sits, and a useful predictor of what will happen next if conditions do not change.

Phase one: clean by accident.

A dimension in phase one is consistent across systems, but the consistency is not the product of governance. The dimension is young, or it sits in a single system that has not yet been integrated with others, or the volume of records is low enough that one person has been able to maintain consistency by paying attention.

The signature of phase-one dimensions: fewer than a hundred records, fewer than ten distinct values, one team responsible, no formal standard. The consistency holds because the conditions that produce variance have not yet been activated. The dimension is one new system, one new team, one organisational restructure away from leaving phase one. The transition is not gradual; it is the event of a new producer joining the system and bringing their own conventions.

Most practitioners encountering a phase-one dimension assume it will stay clean. It will not, but the moment of departure is unpredictable, which is why phase one is the most deceptive of the four. Organisations that have phase-one dimensions in critical positions are taking on risk they cannot quantify.

Phase two: surface variance.

A dimension in phase two has accumulated the kind of variance that humans introduce when entering text into uncontrolled fields. Casing differences. Abbreviation variants. Trailing whitespace. Misspellings. The values look similar enough that a human eye can recognise them as the same concept; they fail to match on exact-string comparison.

The signature of phase-two dimensions: hundreds to thousands of records, three to ten variants per real-world value, one or two teams primarily producing, an informal sense among practitioners that “the data is a bit messy” but no specific accounting of what is wrong.

Phase two is the easiest phase to identify and the easiest to resolve. Surface variance responds well to deterministic normalisation rules: lowercase, trim whitespace, expand known abbreviations, collapse known synonyms. A canonical reference for a phase-two dimension is straightforward to build because the underlying concepts are clear and the variance is mechanical.

The risk in phase two is complacency. An organisation that resolves phase-two variance and declares the problem solved is treating the symptom and leaving the conditions in place. The next phase is already accumulating underneath.

Phase three: semantic and definitional variance.

A dimension in phase three has variance that cannot be resolved mechanically. Different teams have used different labels for the same concept. Different parts of the organisation have used the same label for different concepts. The variance is no longer about casing and whitespace; it is about meaning.

The signature of phase-three dimensions: thousands to tens of thousands of records, dozens of variants per concept, multiple teams independently producing, an active dispute among teams about what the canonical form should be. Reports that look correct produce numbers that do not reconcile across functions. Cross-system analysis requires bespoke mapping for every query.

Phase three is the phase where most organisations live with their most important dimensions. The phase resists mechanical resolution because the resolution requires human judgement about what the canonical form should be, and the judgement requires authority that no individual has unilaterally. The canonical reference for a phase-three dimension is built through negotiation, not through transformation. It is also the phase where the cost of leaving the problem unaddressed becomes legible at the business level: wrong reports, blocked analyses, eroding trust in the data function.

The transition from phase three to a managed state requires the operating model the target state document describes. Named stewards with the authority to decide. An intake process that gives that authority procedural form. Propagation that reaches every system. Without these, attempts to clean phase-three variance produce three cleanups that disagree with each other.

Phase four: agent-driven configuration variance.

A dimension in phase four has agents as significant producers of its values. The agents do not typo and do not improvise abbreviations. They reproduce, faithfully and at scale, whatever convention they were configured with. The variance that emerges is the variance between configurations.

The signature of phase-four dimensions: high volume of records (millions, in some systems), structurally clean values that nevertheless do not reconcile across producers, no individual records that look wrong on inspection, growing gaps in cross-system analysis that cannot be attributed to any visible cleanup deficit.

Phase four is the phase that the manifesto and the primer describe as the new frontier. It is also the phase that most existing tools are least equipped to address, because the standard data quality interventions (validation rules, deduplication algorithms, fuzzy matching) operate on the data as it appears in the warehouse, not on the configurations that produced it.

A canonical reference for a phase-four dimension is not enough on its own. The reference also needs to be operationally consulted by the agents, through the runtime mechanism the target state describes: synchronised local caches, staleness-aware confidence, consumer-side policy. Without the runtime layer, the reference exists in the warehouse and the agents continue to produce their respective configurations, and the variance compounds.

The four phases in summary.

Phase	Variance type	Volume	Producers	Resolution
1. Clean by accident	None visible	Low	One team or one system	None needed yet; risk is unmonitored
2. Surface variance	Mechanical (casing, whitespace, abbreviations)	Medium	One or two teams	Deterministic normalisation
3. Semantic and definitional variance	Human judgement required	High	Multiple teams	Operating model: stewards, intake, propagation
4. Agent-driven configuration variance	Configurations conflicting at scale	Very high	Agents	Runtime layer: cache, confidence, consumer policy

How to identify which phase a dimension is in.

Three diagnostic questions, ordered from coarsest to most precise.

The first question: is the variance visible on inspection of the data? If a human eye scanning a sample of records can spot inconsistencies, the dimension is in phase two or phase three. If the data looks clean on inspection but cross-system reports do not reconcile, the dimension is in phase three or phase four. If the data looks clean and reports reconcile, the dimension is in phase one (or, rarely, is genuinely well-managed; the diagnostic does not distinguish these two).

The second question: who is producing the values? If humans are the dominant producers, the dimension is in phase one, two, or three. If agents are a significant share of producers, the dimension is in phase four regardless of what the other diagnostics suggest, because phase four is defined by the producer mix rather than by the variance pattern.

The third question: does a canonical reference exist for this dimension that is actively consulted by consumers? If yes, the dimension is being managed and the phases do not apply in the same way. If no, the dimension is in one of the four phases described, and the appropriate intervention is determined by which one.

Why the model matters.

The model exists because the most common error in dimensional data work is applying the wrong intervention to the wrong phase. Mechanical normalisation applied to phase-three variance produces three teams cleaning independently and disagreeing. An operating model applied to phase-one variance is over-engineering for a problem that has not yet appeared. A canonical reference without a runtime layer applied to phase-four variance produces a beautiful reference that nobody consults.

Naming the phase is the first step. The interventions that work in each phase are the subject of the target state; what this article offers is the diagnostic that determines which intervention applies.