cleandimslearnfoundations1. What is a dimension

What is a dimension?

Every dataset that an organisation works with contains two structurally different kinds of values. The distinction is fundamental to understanding what dimensions are.

PAGE1 of 7MODULEFoundationsREADING TIME~ 5 min

The first kind is quantitative. Numbers that measure something. Revenue is a quantitative value. Headcount is a quantitative value. Latency, page views, account balance, days to close. These values can be added together. They can be averaged. They can be ranked. The information they carry is in the size of the number.

The second kind is categorical. Labels that classify something. Country is a categorical value. Vendor is a categorical value. Job title, deal stage, support ticket type, customer segment. These values cannot be added together in any meaningful way. The information they carry is in the identity of the value: what the label refers to, and which other records share the same label.

A revenue figure is a quantitative value. The country, segment, and product that revenue is associated with are categorical values. The convention in data work is to call the quantitative values “measures” and the categorical values “dimensions.” The terms come from data warehousing, where measures sit in fact tables and dimensions sit in dimension tables that join to the facts.

For the rest of this curriculum, the term used is dimensions. Other vocabularies exist. Some practitioners say “categorical data.” Some say “reference data.” Some say “controlled vocabulary.” The terms carry slightly different connotations but refer to the same underlying class of values.

MEASURESQuantitative
  • Revenue
  • Headcount
  • Latency
  • Page views
  • Account balance
  • Days to close
+ sumx̄ average↕ rank
DIMENSIONSCategorical
  • Country
  • Vendor
  • Job title
  • Deal stage
  • Support ticket type
  • Customer segment
⤓ group⏚ filter⨝ join
The two kinds of values, the operations meaningful on each.

Why this distinction matters.

The reason measures and dimensions deserve to be separated is that they behave differently under nearly every operation that matters in analysis.

Measures support arithmetic. Two revenue figures can be summed to get a total. A column of latency measurements can be averaged. A series of account balances can be ranked from lowest to highest. The operations are meaningful because the numbers have magnitude.

Dimensions do not support arithmetic. There is no meaningful way to sum two values of the Country dimension. There is no average of three values of the Vendor dimension. The operations that work on dimensions are different: grouping (count of records by country), filtering (records where vendor is “AWS”), and joining (records linked through a shared customer segment).

A second difference: measures are usually generated by systems automatically (the system records the transaction amount, the timestamp, the page view count) while dimensions are usually entered or assigned by people or by classifying systems. The amount on an invoice is calculated; the vendor on the invoice is entered or selected. This difference is the seed of every dimensional data problem the rest of this curriculum will describe.

A quick test.

Look at any table in any system you work with. For each column, ask: can I add two values of this column and get something meaningful? If yes, the column is a measure. If no, the column is a dimension (or an identifier, or a date, or unstructured text, but most often a dimension in a business context).

A typical operational record will have a few measures (amount, quantity, duration) and many dimensions (vendor, category, status, region, segment, owner, type). The dimensions outnumber the measures, usually by a substantial ratio. This is not accidental. Dimensions are the structure that gives the measures their meaning.

GOING DEEPER