cleandimsopen sourcelibraryWhat Kind of Data Is Dimensional

What kind of data is dimensional.

The five primary categories of data found in business systems. Where dimensional treatment applies, and where it does not.

BYCleanDimsPUBLISHEDREADING~ 6 minVERSIONv 1.0

The primer establishes that dimensional data is a coherent class distinguishable from numeric, identifier, temporal, and free-text data. This article goes one step further: it walks through the full landscape of data types found in business systems and identifies, for each, whether and where dimensional treatment applies. The reader who finishes this article should be able to look at any field in any system and answer the question, is this dimensional, and if so, what kind of dimensional?

The boundaries are not always sharp. Some fields are mixed, some are dimensional in one use and identifier in another, some are temporal but accessed as dimensional. The article describes the categories and the edge cases; the judgement about which applies to a specific field in a specific system is left to the practitioner who knows that field.

The five primary categories.

Most data in business systems falls into one of five categories. The categories are distinguishable by what kind of information the value carries and how the value behaves under aggregation and grouping.

Numeric data. Values that measure magnitude. Revenue, headcount, latency, page views, temperature, account balance. Numeric data supports arithmetic; summing and averaging produce meaningful results. The information is in the size of the number.

Identifier data. Values that uniquely refer to an entity. Customer ID, order number, employee ID, account UUID, transaction reference. Identifiers do not support arithmetic; they support lookup and reference. The information is in the identity of the entity referenced.

Temporal data. Values that locate an event in time. Order date, event timestamp, contract start date, last-modified time. Temporal data supports duration and interval arithmetic, ordering, and grouping into time buckets. The information is in the position of the event in time.

Free-text data. Values that contain unstructured prose. Description fields, notes, comments, addresses (in some treatments), customer feedback, support ticket bodies. Free-text data does not support direct grouping or aggregation; it requires natural language processing or structured extraction before it can be analysed in those ways.

Dimensional data. Values that classify or categorise. Country, segment, vendor, status, deal stage, product category. Dimensional data does not support arithmetic; it supports grouping, filtering, and joining. The information is in the identity of the category and in the population of records that share it.

These five categories are not exhaustive but they cover the substantial majority of fields in operational and analytical systems. The categories are also not mutually exclusive in practice; some fields exhibit characteristics of more than one. The hybrid cases are useful to study because they are where most confusion arises.

Dimensional data subdivided.

Dimensional data is the focus of this article, and it has sub-categories worth naming explicitly. The same management discipline applies across all of them, but the specific challenges differ.

Controlled-vocabulary dimensional data. Fields that draw from a finite set of accepted values, where the values are stable and the set is enforced. Currency codes. Country codes. Severity tiers. These are dimensional in nature and well-behaved in practice; the management problem is small because the controlled vocabulary is doing the work that the canonical reference would otherwise do.

Uncontrolled-vocabulary dimensional data. Fields that are conceptually dimensional (used for grouping and filtering) but stored as free text. Vendor names, internal product categories, customer segments, job titles, campaign tags. This is the category where most dimensional data problems live, because the lack of enforcement at the point of capture allows variance to accumulate. The canonical reference exists to provide, externally, what the source system failed to provide internally.

Hierarchical dimensional data. Fields where values have parent-child relationships. Industry (Technology > Software > Enterprise Software). Geography (city > state > country > region). Organisational structure (team > department > function). Hierarchy adds structural requirements that flat dimensions do not have, primarily around enforcing valid combinations across the levels.

Composite dimensional data. Fields that combine multiple dimensions into a single value, usually for convenience. UTM tags (source + medium + campaign + content). Cost-centre codes (region + function + project). Asset tags (location + type + vintage). Composite dimensions look like single fields but are actually several dimensions concatenated; the management challenge is decomposing them into their component dimensions before normal dimension management applies.

Hybrid and ambiguous cases.

Several common fields straddle the boundaries between categories and deserve specific attention.

Identifiers that look dimensional. A customer ID is, strictly, an identifier. It uniquely refers to a specific customer. It is not classifying that customer into a category. But in practice, “customer ID” is often used as a grouping key in reports (“revenue by customer”), at which point it behaves dimensionally. The treatment depends on the use: as a join key, it is an identifier; as a grouping field, it is a high-cardinality dimension. Most warehouses model it as both, with the identifier role in the fact table and the dimension role in a customer dimension table.

Dimensions that look temporal. Fiscal quarter, fiscal year, month-of-year, day-of-week. These have temporal information embedded in them but are typically used as dimensional grouping fields in reports. The management approach is closer to a dimension than to a date: there is a canonical set of values (twelve months, four quarters, etc.), and the reference can be enumerated.

Temporal data that should be dimensional. Specifically, named time periods. “FY24 Q3.” “Black Friday week.” “The 2024 launch window.” These are time periods with names, used as dimensions in analysis. The canonical reference holds the named periods and their date ranges; the dimension surfaces the names; the temporal arithmetic is downstream.

Free-text that contains dimensional data. A description field where the structured information (competitor name, loss reason, region) has been entered as part of unstructured prose because no proper field existed. The data is conceptually dimensional but stored as free text. The management approach involves either extracting the dimensional content into proper fields (the structural fix) or treating the free-text field as a dimensional source via extraction (the workaround).

Numeric data that should be dimensional. Specifically, codes that look like numbers. Account codes, cost centre codes, product SKUs that are numeric. These are dimensional in function (they classify) but stored as numbers. The handling depends on whether the numeric form has arithmetic meaning (in which case it is numeric) or is purely an opaque identifier (in which case it is dimensional, and treating it as numeric will introduce subtle bugs).

How to decide for a specific field.

The diagnostic for any field in any system is straightforward.

Does the field support meaningful arithmetic? If sums, averages, and ranges produce useful results, the field is numeric. If they do not, the field is one of the other four categories.

Is the field used to uniquely identify an entity? If two records with the same value refer to the same real-world thing, the field is an identifier. If two records with the same value refer to two things that are the same in category but not the same entity, the field is dimensional.

Is the field a date, timestamp, or other position-in-time? If so, it is temporal, with the caveat that named periods derived from temporal data may be dimensional in their primary use.

Is the field unstructured prose? If so, it is free text, with the caveat that structured dimensional content may be embedded inside and may need extraction.

If none of the above apply, the field is dimensional. The remaining question is what kind of dimensional, which depends on whether the vocabulary is controlled, whether values have hierarchy, and whether multiple dimensions are composited into a single value.

Why the distinction matters for management.

Each of the five primary categories needs different treatment under data management. Numeric data needs validation against expected ranges and distribution monitoring. Identifier data needs uniqueness enforcement and referential integrity. Temporal data needs format normalisation and timezone handling. Free text needs structured extraction where relevant. Dimensional data needs a canonical reference, named ownership, an intake process, propagation, and the runtime layer that the target state describes.

A common error in data quality programmes is treating all the categories under one umbrella, with the same tools and the same processes. The umbrella works for the simple cases (validation, completeness, uniqueness) and fails for the cases that require category-specific treatment. Dimensional data is the category that suffers most from the one-size-fits-all approach, because the tools that work on the other categories explicitly do not address the semantic equivalence problem that dimensional data needs solved.

Naming the categories is the first step in giving each one the treatment it needs.