cleandimslearnGlossary

Glossary.

Terms used across the CleanDims foundational documents and learning materials. Anchorable. Cross-referenced.

TERMS33SECTIONS13ANCHOR PATTERN/learn/glossary#term-slug

This glossary defines the terms used throughout the CleanDims site. Each entry is independently anchorable; the URL pattern is /learn/glossary#term-slug. Cross-references between entries appear inline. Cross-references to the foundational documents and to the problem catalogue appear where useful.

The glossary is alphabetised. Terms are grouped where related entries share a root concept. The reader looking for a specific term should use the page's search; the reader trying to orient should read sequentially.

A

2 terms
Agent
In the CleanDims context, software acting on behalf of a user to perform a task that previously required human action. A finance agent classifying expense reports, a sales agent categorising leads, a support agent classifying tickets. Agents are increasingly significant producers of dimensional values, which is the basis for the agent-era variance argument the manifesto develops.
Alias
An observed form of a dimensional value that maps to a canonical form within the canonical reference. When a source system records “AWS” and the canonical form is “Amazon Web Services, Inc.,” the string “AWS” is an alias for the canonical. Aliases are stored within the reference itself, alongside the canonical entry they resolve to, rather than as a separately maintained mapping. See Layer 1 of the problem catalogue for the forms variance takes that aliases resolve.

C

8 terms
Cache
The local copy of the canonical reference that an agent or pipeline maintains for in-process lookup. Caches synchronise continuously to the reference. They allow classifications to happen at memory speed without a round trip to a remote system. Cache freshness is reflected in the confidence signal that accompanies the cache's output. The detailed architecture is in the target state under propagation and enforcement.
Canonical
The agreed, authoritative form of a value within a dimension. When five source systems record a single vendor as “AWS,” “aws,” “Amazon Web Services,” “Amazon Web Services, Inc.,” and “Amazon AWS,” the canonical form is whichever of these (or some sixth option) has been chosen by the steward and recorded in the canonical reference as authoritative. The canonical form is the one all consumers join against; the variants are stored as aliases.
Canonical reference
The authoritative source for a dimension's accepted values, definitions, ownership, version history, and aliases. The reference is system-agnostic; it does not live inside a warehouse or catalog or master data system; it is its own surface, exposed via API to any consumer that needs to read it. The full structure is specified in the target state.
Categorical data
Data values that classify or describe rather than measure. See dimensional data; the two terms refer to the same class of values.
Confidence
The signal that accompanies an agent's classification output, reflecting how reliable the classification is in context. Confidence is a function of cache freshness (an agent classifying against a current cache produces full-confidence output; an agent classifying against a stale cache produces output with a confidence penalty) and of how well the input value matched a known canonical or alias. Confidence is the signal that consumer policy decisions act on.
Configuration variance
The form of dimensional variance that emerges when agents are the dominant producers of values. Where human-entered data accumulated surface variance (typos, casing) and semantic variance (synonyms), agent-entered data accumulates configuration variance: two agents configured with two different conventions reproduce those conventions faithfully and at scale, producing data that is internally clean and structurally incompatible across producers. See the manifesto for the argument that configuration variance is the dominant new failure mode.
Consumer
In the CleanDims architecture, any system that reads dimensional values produced elsewhere. A BI tool consuming dimensional values from a warehouse. A model training pipeline consuming dimensional features. An external report consuming the warehouse output. Consumers receive values from producers (humans or agents) through the canonical reference and apply their own consumer policy to handle output with low confidence.
Consumer policy
The configurable behaviour that a consumer applies to dimensional values arriving with non-full confidence. A regulatory consumer of a vendor classification might halt at zero tolerance. An analytical consumer of the same classification might tolerate hours of cache staleness. The target state describes this as the locus shift: the halt decision moves from the agent (which lacks context) to the consumer (which has the context).

D

5 terms
Definitional variance
The form of variance in which the same label is used for different things in different parts of the organisation. The sales team's “Mid-Market” means companies between 100 and 999 employees. The finance team's “Mid-Market” means companies between $1M and $10M in annual revenue. The label is identical; the underlying populations are incompatible. The most dangerous form of variance because it produces no visible inconsistency in the data.
Dimension
A categorical attribute used to classify, group, or filter records. Country is a dimension. Vendor is a dimension. Customer segment, support ticket type, product category, deal stage: all dimensions. The convention comes from data warehousing, where dimensions are the axes along which facts are sliced.
Dimensional data
The class of data values that are categorical rather than quantitative. Distinguishable from numeric data (which supports arithmetic), identifier data (which uniquely refers to entities), temporal data (which locates events in time), and free-text data (which contains unstructured prose). Dimensional data is the focus of the CleanDims discipline; the full categorisation is in What Kind of Data Is Dimensional.
Directory
In the CleanDims architecture, the organisation-wide index that records which canonical reference owns which dimension. The directory carries no values and no definitions; it is purely an index from dimension name to authoritative reference. The directory prevents the duplicate-governance failure mode in which two federated nodes both claim authority over the same dimension.
Drift
The rate at which new variants appear in a dimension over time. Rising drift suggests an upstream change (a new agent deployed without coordination, a new geography onboarded with local conventions, an organisational shift the reference has not absorbed). Drift is one of the five metrics the target state describes.

E

1 term
Entity resolution
The process of determining whether two records refer to the same real-world entity. Entity resolution is the primary concern of Master Data Management. It is distinct from dimension management, which governs the categorical attributes on the records that have been resolved. See Why Dimensional Data Outlives Every Tool for the longer treatment.

F

1 term
Federation
The architectural pattern in which multiple canonical references coexist within an organisation, each scoped to a domain that has the context to maintain it. Vendor names sit with procurement; product categories sit with product; customer segments sit with sales or revenue operations. The directory prevents conflicts by ensuring that each dimension is owned by exactly one federated node.

G

1 term
Granularity variance
The form of variance in which the same concept is recorded at different levels of a hierarchy across systems. The CRM records industry as “Financial Services.” The data warehouse records it as “Retail Banking.” Each value is correct at its own level; the values cannot be aggregated together without first agreeing on a canonical depth.

H

1 term
Hierarchical dimension
A dimension whose values have parent-child relationships. Geography (city > state > country > region). Product (SKU > line > category > family). Organisational structure (team > department > function). Hierarchical dimensions require the canonical reference to capture parent-child relationships explicitly and to validate that records do not contain impossible combinations across levels.

I

2 terms
Identifier
A value that uniquely refers to a specific entity. Customer ID, order number, employee ID, transaction reference. Identifiers do not classify; they refer. They are distinct from dimensional data in their primary use, though some identifiers (notably customer ID) are used as grouping fields in reports and behave dimensionally in that context.
Intake
The workflow by which new values are proposed, reviewed, and accepted (or rejected) into the canonical reference. Intake originates from two sources: automatic detection (the system observes a value below the auto-mapping confidence threshold), and manual proposal (anyone in the organisation opens a request). Every modification to the reference originates in a request and leaves a record in the change log.

M

2 terms
Master Data Management (MDM)
The discipline of managing master records of core business entities (customers, products, suppliers, employees) and synchronising those records across systems. MDM and dimension management are complementary: MDM governs the entity, dimension management governs the categorical attributes on the entity. See Why Dimensional Data Outlives Every Tool.
Measure
A quantitative value in a dataset. Revenue, headcount, latency, page views. Distinguishable from dimensional data by whether arithmetic is meaningful. The complement of dimensions in data warehouse design.

R

2 terms
Request
The durable artifact capturing a proposed change to the canonical reference: who proposed it, the reasoning, the discussion among reviewers, and the resolution. Requests are the primitive through which all changes flow. They originate from automatic detection or manual proposal and resolve to one of three states: accepted, rejected, or merged into an existing canonical value as an alias.
Resolution coverage
The percentage of records flowing through the system that arrive at consumers with a canonical value attached, regardless of confidence level. A record resolved with a warning is still resolved; a record arriving with no canonical at all is the unhealthy state. Low resolution coverage points to gaps in connector coverage, missing canonical references, or pipelines not wired to the reference. One of the five metrics the target state describes.

S

4 terms
Semantic variance
The form of variance in which different strings refer to the same concept and the equivalence cannot be detected from the strings alone. “Laptop” and “Notebook.” “Freelancer” and “Independent Contractor.” Resolution requires picking which label is canonical and mapping the others to it; the decision requires human judgement.
Steward
The named individual with decision rights for a dimension. The steward holds authority over additions, definitional changes, deprecations, and contested decisions within their dimension. Stewardship is typically part-time and layered onto an existing role rather than constituted as a new full-time function. The target state describes the stewardship model in detail.
Structural contamination
The form of variance in which a dimensional field contains data that does not belong in a single field: composite codes encoding multiple dimensions, multi-value strings, hierarchical paths stored flat, metadata baked into values. Structural contamination must be unpicked before any other resolution can begin; it is the prerequisite layer.
Surface variance
The form of variance in which the same concept is recorded with different formatting: casing, punctuation, spacing, abbreviations. The easiest form to detect and the easiest to resolve. As agent production grows, surface variance becomes a smaller share of the problem; configuration variance becomes a larger share.

T

1 term
Temporal variance
The form of variance in which the canonical form changes over time and historical records retain the older form. A product renamed mid-year. A department restructured. A vendor rebranded. The new label is applied to new records; old records keep the old label; queries spanning the boundary fail to reconcile.

V

3 terms
Variance
The general term for dimensional inconsistency. Variance takes surface, semantic, definitional, granularity, and temporal forms (plus three additional forms enumerated in Layer 1 of the problem catalogue). The shape of the variance determines what intervention will resolve it.
Variant
An observed form of a dimensional value, before resolution to a canonical. Variants are recorded as aliases in the canonical reference once they have been mapped. The set of variants for a given canonical is typically much larger than the set of canonicals themselves; the work of dimension management is largely the work of mapping variants to canonicals.
Version
A specific state of the canonical reference at a point in time. Consumers pin to versions so that historical analysis remains interpretable. Records written under version N can still be resolved under version N+1, with the mapping captured explicitly in the reference's change log.