cleandimslearnfoundations3. Where dimensions come from

Where dimensions come from and where they break.

PAGE3 of 7MODULEFoundationsREADING TIME~ 7 min

Dimensional values do not start in dashboards. They start in operational systems where people are doing other work, and they travel through pipelines and integrations before arriving in the places where they are eventually analysed. Understanding the journey is the second step toward understanding why dimensions go wrong; the first step (the five forms of variance) was about the shape of the failure, and this page is about the route by which the failure is produced.

The point of capture.

Dimensional values originate in operational systems. The CRM records a customer's industry when a sales rep creates the account. The ERP records a vendor's category when procurement onboards the supplier. The ticketing platform records a support category when an agent classifies an incoming issue. The marketing automation platform records a campaign tag when a marketer sets up the send.

In every case, the system that captures the value is not optimised for the dimensional field. The CRM is optimised for sales workflow; the dimensional fields exist to be filterable later. The ERP is optimised for transactional throughput; the categorical attributes are secondary. The ticketing platform is optimised for resolution speed; classification is the cost the agent pays to do their primary job.

This means the field where the value is entered is typically minimally constrained. A free-text field where a dropdown would be more accurate. A picklist that has not been updated in two years. An “Other” bucket with a text field beside it that absorbs everything that did not fit. The capture point is where most variance originates, and the variance is the predictable result of the operational system optimising for something other than dimensional consistency.

The transit.

Once captured, the value moves. It moves through ETL pipelines into warehouses. It moves through APIs between systems. It moves through manual exports into spreadsheets. Each transit is an opportunity for further degradation.

A pipeline might truncate values that exceed a target schema's field length. It might lowercase everything for consistency, destroying meaningful casing distinctions. It might fail to handle accented characters when converting between character encodings. It might silently drop records where a field does not match a hardcoded mapping table that has gone stale.

API integrations have their own failure modes. The integration script captures fifteen of the twenty fields the source API returns; the five it does not capture include dimensional information that now has nowhere to land in the target system. A third-party API changes its schema without notice; the integration starts producing values in a different format alongside the old format, and nobody is monitoring the difference.

Manual transit (exports, copy-paste, spreadsheet pipelines) is the most error-prone. Encoding artefacts, formatting shifts, transposition errors, version-control failures. Every iteration introduces some degradation, and the degradation accumulates because each pass starts from the output of the previous pass.

The consumption.

The values eventually arrive in analytical systems: warehouses, BI tools, dashboards, ML pipelines. At this point, the dimensional values are no longer just things the operational system happened to store. They are the axes along which the business is measured, modelled, and decided.

A revenue total is grouped by department. A churn rate is calculated by segment. A model is trained with industry as a feature. The value is doing structural work: it determines what gets summed together, what gets filtered out, what relationships the model can learn. The accuracy of every downstream conclusion depends on the consistency of the categorical surface that the analysis is grouping by.

The consumer rarely has control over the values they receive. They cannot easily push corrections upstream to the operational systems that captured the values, because those systems are owned by other teams with different priorities. The consumer's options are: clean the values locally (which works for the current analysis but does not fix the source), accept the inconsistency and document it (which works for one report but does not scale), or do nothing (which is what most analyses end up doing because the alternative is unsustainable).

Why the journey matters.

Each stage of the journey produces some specific subset of the failure modes from the previous page. The capture point is where surface, semantic, and definitional variance most often originate. The transit is where granularity mismatch and structural contamination accumulate. The consumption point is where temporal variance becomes visible, because the analysis spans the time periods over which the dimension has drifted.

The implication for management: cleaning the data at the consumption point treats the symptom. Fixing the capture point and the transit treats the cause. The infrastructure that the target state describes operates across all three stages: validating at capture, propagating through transit, and resolving consistently at consumption.

The detailed taxonomy of where things break is in Layer 2 of the problem catalogue, which enumerates nine specific process failure modes. This page is the high-level map; the catalogue is the full reference.

GOING DEEPER