Target state of dimension management

The target state of dimension management has four parts: a canonical reference, a runtime, a workflow, and a measurement model. This document specifies what well-managed dimensional data looks like in operational detail and traces the maturity path organisations follow on the way to it.

I. What this document specifies.

This is the specification of the discipline of dimension management. It describes what well-managed dimensional data looks like in operational detail: the canonical reference that holds the accepted values, the runtime through which producers and consumers interact with that reference, the workflow by which the reference stays current, the metrics by which the system is measured, and the maturity path organisations follow on the way to it.

The specification is written for the practitioner who is going to build or operate the system, and for the leader who is going to fund it. It is not an argument for the discipline; that is the job of the manifesto. It is also not a tutorial; that is the job of the Foundations module. It is the description of the target the discipline aims at, written densely enough to be cited in design reviews and tender documents.

Five sections follow. The canonical reference. The runtime. The workflow. The metrics. The maturity path. A final section names what the target state is not.

II. The canonical reference.

The canonical reference is the authoritative source for a managed dimension. It is a system-agnostic surface, exposed via API, that any consumer can read from. The reference does not live inside a warehouse or a catalog or an MDM platform; those systems are themselves consumers of the reference.

For each managed dimension, the reference holds:

Accepted values. The set of canonical strings for the dimension. The resolution unit. Every alias in the data and every validation rule against an operational system resolves to one of these.
Definitions. Each accepted value carries a definition that is precise enough to disambiguate edge cases. “Mid-Market: a company with 100-999 employees, measured at the legal entity, point-in-time annual.” The definition is what stewards adjudicate against when a case is contested.
Aliases. Each accepted value carries a set of observed variants that resolve to it. Aliases live alongside the canonical rather than in a separate mapping table; this is the property that makes the reference operationally load-bearing.
Ownership. Each dimension names a steward and a steward team. The steward holds decision rights; the team holds shared context. Both are recorded structurally so that any consumer can route an escalation correctly.
Version history. Every change to the reference produces a new version. Earlier versions remain readable indefinitely; records written under version N must remain interpretable under every subsequent version.
Change log. Each version commit carries the change rationale, the upstream request that prompted it, and the steward who approved it. The log is the audit surface.

Federation is the architectural pattern by which multiple canonical references coexist within an organisation, each scoped to a domain that has the context to maintain it. Vendor names sit with procurement. Product categories sit with product. Customer segments sit with sales or revenue operations. A directory records which dimension is owned by which canonical reference; the directory carries no values of its own, only the ownership index. The directory is the mechanism by which federation does not become a free-for-all.

III. The runtime.

The runtime is the path through which dimensional values flow at production speed. The reference is read; values are validated, classified, or annotated; output is emitted with a confidence signal that reflects how reliable the classification was. The runtime is what makes the reference operationally load-bearing rather than advisory.

Caches. Producers and consumers maintain local caches of the canonical reference, synchronised continuously to the authoritative source. Caches make classification happen at memory speed without a round trip to a remote system. Cache freshness is the dominant input to the confidence signal.
Producers. Agents and pipelines that produce dimensional values classify against the cache. Where a value matches a canonical or a known alias, the producer emits a high-confidence resolved record. Where the value matches nothing, the producer emits an unresolved record and opens an intake request.
Confidence. Every produced value carries a confidence signal. The signal is a function of cache freshness, match quality, and the producer's own uncertainty. Confidence is structured rather than impressionistic; downstream systems can act on it programmatically.
Consumer policy. Consumers decide what to do with non-full-confidence values according to a configurable policy. A regulatory consumer halts on anything below full confidence. An analytical consumer tolerates hours of staleness. A monitoring consumer treats low confidence as a measurable signal of upstream drift. The halt decision is locus-shifted from the producer (which lacks context) to the consumer (which has it).

The architecture is designed to scale to machine-speed production. Where dimensional values were historically produced by humans and bounded by typing speed, agents now produce values at orders of magnitude faster, and the runtime is the surface that lets the system keep up. Producers do not negotiate validation with the reference at every record; they read from a synchronised cache and emit confidence-weighted output. The reference handles its own propagation across the cache topology in the background.

IV. The workflow.

The workflow is the operational surface for the steward. It handles requests against the canonical reference, routes them to the right humans, captures decisions, and propagates the resulting changes. The workflow is what lets stewardship scale down to a part-time addition to an existing role rather than constituting a new full-time function.

Intake. Every modification to the reference originates in a request. Requests are produced automatically (the runtime observes a value below the auto-mapping threshold) or manually (anyone in the organisation opens one). The request is the primitive through which all changes flow.
Auto-resolution. Where the system is confident, a known alias for an existing canonical, a value matching an obvious normalisation rule, the request resolves automatically and leaves a record in the log. The steward reviews resolutions in aggregate rather than individually.
Steward queue. Where the system is not confident, the request enters the steward's queue. The queue is prioritised by business impact: requests on dimensions feeding consequential consumers surface first; one-off values with no recurrence surface last. The steward's attention is the bottleneck the queue protects.
Resolution. Each request resolves to one of three states: accepted (a new canonical), rejected (the value should not be in the reference), or merged into an existing canonical as an alias. The resolution updates the reference, advances the version, and propagates to consumers via the cache topology.
Deprecation. Existing canonicals are retired through a controlled lifecycle: marked deprecated, given a sunset window, removed from new-record validation while remaining readable for historical records. Deprecation is a first-class operation, not an ad-hoc deletion.

The workflow is configurable per dimension. A fast-changing dimension like customer segment may auto-resolve aggressively; a slow-changing high-stakes dimension like account classification may require human review on every change. The point of configurability is that the discipline of dimension management is the same across dimensions but the rate and tolerance differ.

V. The metrics.

Five metrics measure whether the system is working. Each is computable from the runtime and the workflow; none requires external instrumentation. Together they form the health dashboard the steward operates against and the signal the funding leader uses to justify continued investment.

Coverage. The fraction of operational records flowing through the system that arrive at consumers carrying a canonical value. High coverage means the reference is reachable from every producer. Low coverage points to gaps in connector wiring or to dimensions that exist in the operational systems but not in the reference.
Drift. The rate at which new variants appear in a dimension over time. Rising drift suggests an upstream change: a new agent deployed without coordination, a new geography onboarded with local conventions, an acquisition that introduced parallel structures. Drift is the early-warning signal for the canonical reference needing intervention.
Resolution latency. The time from a request being opened to its resolution. Short latency means the steward queue is keeping pace with the inflow. Long latency means requests are accumulating and the cache freshness signal is degrading. Resolution latency is the steward's primary operational metric.
Resolution coverage. The fraction of records arriving at consumers with full confidence rather than degraded confidence. Distinct from coverage in that a record can be resolved (a canonical attached) but resolved with a warning. Low resolution coverage is what consumers feel directly; it is the metric a model team will see degrade when the reference falls behind.
Confidence depth. The distribution of confidence values across resolved records, not just the mean. A bimodal distribution (a spike at full confidence and a spike at low confidence) tells a different story than a centred distribution. Confidence depth is the metric stewards use to understand the character of the variance their reference is not yet absorbing.

These metrics are not vanity numbers. Each is acted on. A rising drift number triggers a steward review of the upstream producer. A latency excursion triggers a queue rebalance. A confidence-depth shift triggers a definition review. The metrics are part of the operating model, not a dashboard appended to it.

VI. The maturity path.

Organisations do not arrive at the target state in a single project. They progress through recognisable stages. The stages are not branded as a maturity model in the usual sense; they are the description of what the work actually looks like along the way.

Stage one: name the dimensions that matter. Before any reference exists, the organisation identifies which dimensions are load-bearing for which consumers. Most dimensions in most systems do not warrant management; the small subset that does will carry most of the value. Stage one is the inventory that distinguishes the two.
Stage two: build the first reference. A single canonical reference for a single managed dimension, with a named steward and a workflow. The goal at this stage is not coverage; it is to prove that the operating model works in the organisation's context. The first reference is usually a high-stakes but low-volume dimension where the steward already exists informally.
Stage three: wire the runtime. Producers and consumers begin reading the reference at runtime. Caches are deployed. The confidence signal starts flowing downstream. Consumers configure their policies. Coverage and drift become measurable. This is the stage at which the discipline starts producing benefits that are visible outside the team that built the reference.
Stage four: scale by federation. Additional references are added, each owned by the domain that has the context for it. The directory prevents conflicts. The stewardship pool grows; no single steward owns more than they can attend to. This is the stage at which the organisation moves from managing some dimensions well to managing the categorical surface as a whole.
Stage five: operationally load-bearing. The references are read by every system that produces or consumes the relevant dimensions. Drift is monitored as a system signal. Pipelines validate. Agents emit confidence-weighted output. The categorical surface is governed in production rather than in retrospect.

The path is the same across organisations; the time spent at each stage varies. Most organisations have not started stage one. The few that have completed stage five operate a categorical surface that does not appear in the retros because it has stopped producing failures the retros need to capture.

VII. What this target state is not.

Three clarifications, each developed at length in the library and the foundational documents.

The target state is not Master Data Management. MDM governs the entity, which record is the authoritative customer, product, supplier, and resolves duplicates. The target state described here governs the categorical attributes on the entity record: which industry classification the customer carries, which category the product belongs to. The two disciplines are complementary, and most organisations will need both.

The target state is not Data Quality monitoring. DQ tools detect structural problems in data (nulls, type mismatches, range violations) and surface them to the team that maintains the pipeline. The target state described here governs the values themselves, including semantic equivalence problems that no string-comparison rule can resolve. DQ catches the symptoms; dimension management governs the underlying surface.

The target state is not a data catalog. Catalogs organise metadata about data assets and help practitioners find what they need. The target state described here is the source of truth that the catalog points at when it describes the valid values for a dimensional field. A well-maintained catalog without a canonical reference tells you what exists; it does not govern it.

The longer arguments for each of these distinctions are in Why Dimensional Data Outlives Every Tool. The point in the target state is to be specific about what is being specified and what is being left to adjacent disciplines.

VIII. How to use this document.

The target state is the reference document for the discipline. It is meant to be cited in design reviews, referred to in tender responses, and used as the specification that an implementation can be evaluated against. The companion documents extend it in different directions.

The manifesto makes the case for the discipline. The primer establishes the underlying class of data. The four-layer taxonomy catalogues the specific failure modes the target state is a response to. Trace the Chain walks the taxonomy interactively, which is useful for organisations diagnosing where they are in the maturity path.

The product CleanDims is building is the implementation of the target state described here. The discipline, however, is not the product. Organisations can implement this specification with the product, with another vendor, with internal tooling, or with some combination.

What good looks like, specified.

I. What this document specifies.

II. The canonical reference.

III. The runtime.

IV. The workflow.

V. The metrics.

VI. The maturity path.

VII. What this target state is not.

VIII. How to use this document.

The primer.

The target state.

The four-layer taxonomy.