Maintain Context of Changes in Data

The basis for the emergence of slowly changing dimensions was the model - the star schema, which was presented in 1996 by Ralph Kimball in his book “The Data Warehouse Toolkit”. The new modeling technique proposed by Kimball allowed to reduce the amount of data stored in the data warehouse, as well as improve query performance.

In a star schema, data is stored in fact tables and dimension tables . Fact tables contain data that records events, such as transactions at a retail store, a guest’s hotel reservation, or a patient’s visit to a doctor. Dimension tables store information that enriches the data in the fact tables.

Examples:

Fact	Measurement
Retail Store Transactions	Detailed information about each item sold in this store
Booking a guest at a hotel	Floor, number of beds, bathrooms for all rooms in the hotel
Patient visits to the doctor	Patient information such as address and telephone number

While fact tables are constantly changing to reflect business events, dimension tables are not as frequently changed. But what happens when they do change? What if a dimension changes multiple times? How is historical data preserved?

This is where Slowly Changing Dimensions (SCD) come to the rescue .

Example - Sales Data Model

A sales data model can be created using a star schema to track sales metrics such as revenue, units sold, and profit. The fact table will contain the sales data, and the dimension tables will provide context such as by product, customer, time, and geography.

Star schema

More details:

Fact Table: Sales (Columns: SaleID, ProductID, DateID, StoreID, CustomerID, StateID, Revenue)
Measurement tables:
- Tables related to products:
  - Product (Columns: ProductID, ProductName, CategoryID, BrandID, Price)
  - Category (Columns: CategoryID, CategoryName, ParentCategoryID)
  - Brand (Columns: BrandID, BrandName, ManufacturerID)
  - Manufacturer (Columns: ManufacturerID, ManufacturerName, Country)
- Tables related to dates:
  - Date (Columns: DateID, Date, DayOfWeek, MonthID)
  - Month (Columns: MonthID, MonthName, Quarter, Year)
- Tables related to the store:
  - Store (Columns: StoreID, StoreName, LocationID, ManagerID, OpeningDate)
  - Location (Columns: LocationID, City, StateID, PostalCode)
  - Manager (Columns: ManagerID, ManagerName, ContactNumber)
- Tables related to clients:
  - Customer (Columns: CustomerID, CustomerName, Gender, AgeGroup, MembershipTypeID)
  - MembershipType (Columns: MembershipTypeID, MembershipLevel, DiscountRate)
- Tables related to state:
  - State (Columns: StateID, StateName, CountryID, Region)
  - Country (Columns: CountryID, CountryName, Continent)

Slowly Changing Dimensions (SCD)

This is an approach to updating and maintaining data stored in dimension tables as it changes. There are several different methods for handling changing dimensions, and these methods are commonly referred to as SCD “types.”

Traditionally, data architects and data engineers work hand in hand to plan, develop, and maintain a data model that uses slowly changing dimensions. Data scientists and analysts then use these fact and dimension tables that make up the star schema to train ML models, populate dashboards, or perform a variety of other data-related tasks. In a professional data environment, understanding the basics of slowly changing dimensions and how to use such data helps reduce the time to insights, improving efficiency and reliability.

SCD Type 0 (Unchanged Measurement):

For this type of data, there are no changes. The data retains its identity throughout its entire existence. Such data may include: birthdays, city of birth, postal codes, INN, OGRN, SNILS, codes of regions of Russia, historical events (for example, the date of the city’s foundation), mathematical constants (for example, the number π (pi) or the number e), the currency code RUB, etc.

SCD Type 1 (Rewrite):

With SCD Type 1 , if a record in a dimension table is changed, the existing record is updated or overwritten. Otherwise, a new record is inserted into the dimension table. This means that records in the dimension table always reflect the current state, and historical data is not retained.

For example, a table that stores information about products sold in a grocery store might handle record changes using SCD Type 1. If a record for a particular product already exists in the table, it will be updated with the new information.

Example: If I move potatoes from aisle 11 to aisle 6, then in SCD Type 1 the entry with potatoes will simply be overwritten with the new data:

SCD Type 1 (Rewrite)

SCD Type 1 ensures that there are no duplicate records in the table and that the data reflects the most current state of the dimensions. This is especially useful for real-time (e.g. dashboards) and predictive modeling, where only the current state is important.

However, because the table only stores the most recent information, data analysts cannot compare changes in measurements over time. For example, a data analyst would not be able to determine how revenue from potato sales changed after they were moved to aisle six without additional information.

SCD Type 1 simplifies reporting and analytics for current status, but has limitations when performing historical analysis.

SCD Type 2 (Add New Line):

While a table that only reflects the current state can be useful, there are times when it is necessary to track historical changes to dimensions. In SCD Type 2, historical data is maintained by adding a new dimension change control row, marking the new row as current (True = 1) and marking the previous record as historical (False = 0).

Example:

Continuing to use the grocery store table, we’ll add an additional column , is_current , which stores a boolean value: true if the record reflects the current value, and false otherwise.

And if the potato, just like in the case of considering the example of SCD Type 1 , moves to the 6th trading row. When using SCD Type 2 to record this change, we will create a table that looks like this:

SCD Type 2 (Add New Line)

A new row is added to reflect the change in location of the potato, with the is_current column set to True . To preserve historical data and accurately reflect the current state, the is_current column for the previous record is set to False .

But what if you want to study how potato sales changed after changing their location? This is quite difficult to do using only one column if there are multiple historical records for a single product. Luckily, there is an easy way to do this.

To do this, we’ll take a dimension table containing the same information as before. But instead of the is_current column , we’ll add the start_date and end_date columns . These dates represent the period of time during which the dimension was current. Since the data in this table is the most recent, the end_date is set far in the future.

If the potato moved to row 6 on January 4, 2025, the updated table would look like this:

SCD Type 2 (Add New Line) v2

Notice that the end_date for the first row has been updated to the last day the potato was in aisle 11. A new record has been added, and the potato is now in aisle 6. The start_date and end_date columns help show when the change occurred and indicate which record is current.

Using this technique to implement SCD Type 2 not only preserves historical data, but also provides information about when the data has changed. This allows data analysts to study operational changes, conduct A/B testing, and make informed decisions.

SCD Type 3 (Add New Attribute):

In cases where the number of dimension changes is predetermined and unchangeable, and will generally only change once. Or when we only need to represent the latest historical record, SCD Type 3 comes in handy. Instead of updating the dimension or storing the change as a new row, SCD Type 3 uses a column to reflect the change. This is a bit complicated to explain, so let’s jump right into an example.

The table below contains information about sports teams across Russia. Here, the table contains two columns to store the current and previous stadium name. Since each of these teams uses the original stadium name, the previous_stadium_name column is filled with NULL values .

If Spartak Moscow decides to sign a new 25-year sponsorship contract, the updated table will look like this:

SCD Type 3 (Add New Attribute)

To account for the new stadium name, Otkritie Arena moves to the previous_stadium_name column , and Gazprom Arena takes its place in the current_stadium_name column . The new 25-year sponsorship contract will likely outlive the model under construction, meaning the entry is unlikely to change again.

Using SCD Type 3 makes comparing current data to historical data fairly easy. There is only one row for each command, and current and historical data are side by side in two different columns. However, this means that only one historical record can be stored for one dimension attribute, which can be limiting, especially if the data changes more frequently than expected.

SCD Type 4 (History Table):

The fourth type is more similar to the second type, where we also stored the history of dimensions within a single table, adding another version of the row to the dimension. However, if changes occur quickly, SCD type 2 will generate many unnecessary records, creating difficulties for scaling. This method is similar to the change data collection methods and database audit tables.

Using the same grocery store table as an example, to process record changes using SCD Type 4. We will create an additional table that will receive all changes. And the main table will contain strictly current data:

CD Type 4 (History Table)

In this method, the rapidly changing column is removed from the dimension and transferred to a new dimension table. As you can see in the history table, we record the fact of changes, with the old and new price of the product, as well as the date of the change:

The above implementation of slowly changing type 4 dimensions in a data warehouse allows you to eliminate unnecessary volume in the main dimension. And you can still perform the analysis you need.

SCD Type 5 (Combined Approach):

A hybrid approach that combines SCD Type 4 using a separate table to store historical changes and SCD Type 1 to maintain supporting data.

For an example of SCD Type 5, let’s consider a variant with a grocery store table in which we will reflect two types:

SCD Type 1 for tracking price changes.
SCD Type 4 for storing historical information.

Continuing to work on the same grocery store table, let’s say we want not only to maintain a table with history, but also to immediately see how many times the price for a given product has changed. To do this, we will add an additional column to the fact table displaying the current version of the price, which will be rewritten each time it changes and is transferred to the history table:

SCD Type 5 (Combined Approach)

SCD Type 6 (Combined Approach):

This is a hybrid approach that combines all three main SCD methods, hence the name 1+2+3 = 6. Type 6 is especially useful if you want to keep the full history, as with Type 2, but also have an easy way to influence the current version, as with Type 3.

As an example of SCD Type 6, let’s consider the option with a table of company employees in which we will reflect three types:

SCD Type 2 for tracking job changes.
SCD Type 3 for storing the current section.
SCD Type 1 for updating the current department.

And first, we change the position of employee Ivan from “Manager” to “Senior Manager”. Since we are using SCD Type 2, a new row is created with the updated position, and the previous position is saved in the “PreviousPosition” column. After that, we change the department from “Marketing” to “Sales”. And since this is SCD Type 1, the “CurrentDepartment” column is updated in the current row.

SCD Type 6 (Combined Approach)

This approach allows the company to effectively manage changes in employee data while maintaining both historical and current information.

Data warehouse teams often need to preserve historical attributes and also support the ability to provide historical performance data based on current attribute values. SCD Type 6 is the solution for these challenges.

If you want to see historical facts based on current attribute values, we filter or summarize the data by current attributes. If we summarize the data by a historical attribute, we will see facts as a set at a certain point in time.

Limitations of Slowly Changing Dimensions in Data Science

Although SCDs are an important part of data warehousing, allowing historical data to be stored along with changes, they have some limitations, such as:

Storage space : A dataset with a large number of input features can quickly increase storage requirements. This is especially true for Type 2 SCD, which involves adding a new row for each change.
Maintenance : As the number of attributes or dimensions increases, tracking changes becomes more difficult. This can lead to an increased likelihood of errors.
Performance : As the amount of data needed to track historical changes grows, the performance of machine learning models may degrade.
Scalability : As the volume of data and accumulated changes increases, some types of SCD may not scale effectively. This impacts storage, computation, and visualization.
Redundancy : Most slowly changing dimensions in Data Science involve redundancy. For example, in Type 2 SCD, a new row is added for each change in the dimension. This eventually results in multiple rows for the same entity and increased dimensionality. Dimensionality reduction techniques help identify and manage such redundancy.

How to implement Slowly Changing Dimensions in a data warehouse?

Ideally, you should consider slowly changing dimensions (SCD) at the database design stage. If you are just starting to build your data infrastructure or are working in a startup, this will be the easiest.

For existing databases, start by assessing the data that is already in your database. Document the different types of dimensions and their relationships. If there are type 0 dimensions in the database that shouldn’t be type 0, start with those. And if necessary, add historical data tracking as soon as possible.

Next, determine which SCD types (2, 3, or 4) are right for your business. Ask yourself the following questions:

How often do these measurements change?
Will you use a timestamp column or a flag column?
Is it worth adding a history table to store a large number of records?

Such a process will require the involvement of analysts, architects, developers and data scientists.

Another important point is how you want to handle historical records that have not been tracked before. You can, of course, discard historical data and implement SCD only for future operations, but the best option is to collect old records and create your own version of a column with a timestamp or flag.