Slowly Changing Dimension
What exactly are Slowly Changing Dimensions, and why should you care about them? Many people find themselves asking this question. The term “Slowly Changing Dimension” is kind of odd. I mean, aren’t all Dimensions changing? And why does the speed at which the change matter? Never fear; this article will answer your questions about Slowly Changing Dimensions (SCD), the different SCD Types and how to create them. I even provide example code, which you can test out for yourself to get a deeper understanding.
If you need an introduction to Dimension Tables in general, read our Dimension Table – Introduction article.
What is a Slowly Changing Dimension?
The term “Slowly Changing Dimension” (SCD for short) acknowledges that Dimension Members change, but only slowly, over time. However, most Dimensions are, in fact, slowly changing, so why is the concept of “Slowly Changing Dimensions” so important in Data Warehousing?
The answer is that there are several different “Types” of “Slowly Changing Dimensions,” and each one has a different impact on what data users see in reports and analyses. Each SCD Type represents a different change-handling strategy used to manage changes to Dimension data. Whether users want to report using historical or latest values and how they want to view history dictates the SCD Type you choose to handle changes to Dimension attributes. Therefore, it’s very important you understand the different types of Slowly Changing Dimensions and their impact when designing a Data Warehouse.
Where did the concept of “Slowly Changing Dimensions” come from? Ralph Kimball originally identified the concept to address the requirement of a Data Warehouse to accurately support historical reporting. He defined the 6 now industry-standard change-handling strategies or “Types”. The Types include Types 1, 2, and 3, along with a couple of hybrid techniques, Types 4 and 6.
Although we talk about Slowly Changing “Dimensions”, what we actually implement is Slowly Changing “Attributes”. The SCD change-handling strategies are implemented per attribute. That means different attributes within one Dimension can have different “Slowly Changing Dimension” Types. Developers often think of a whole Dimension as being Type 1, 2, etc. That is a mistake. A Dimension can have Type 1, 2, 3 etc attributes at the same time. The ETL patterns we describe in this post show how to handle multiple SCD Types in one Dimension.
So, to sum up, the term “Slowly Changing Dimensions” is kind of confusing. Why is it important to call out the slowly changing nature of Dimensions if almost all Dimensions are slowly changing? The reason is the different ways of handling slowly changing “attributes” and the impact they have on reporting.
The following sections discuss the SCD Types in detail and describe how to write ETL code to implement the SCD Types.
SHORTCUT: We’ve developed an SQL script to implement the SCD Type 1 and 2 pattern. Why not test it out for yourself and get a copy of the code.
Types of Slowly Changing Dimensions
People generally agree there are two common SCD Types, 1 and 2, and some less-used SCD types, 3, 4 and 6. The list of Slowly changing dimension types follows:
- SCD Type 1 – Don’t keep history, i.e. overwrite.
- SCD Type 2 – Keep a version of the dimension member for every change.
- SCD Type 3 – Keep the latest and prior versions (but not every version).
- SCD Type 4 – Current and History Tables.
- SCD Type 6 – A hybrid combination of Type 1, 2 and 3.
We use the following example to describe each of the Slowly Changing Dimension Types.
The star schema below depicts a typical Sales scenario. Employees sell Products to Customers. The Sales organization is divided into Divisions. Each Sale is a business event and, therefore, can be modelled as a “Sales” Fact. The diagram below depicts the Sales Fact and its related Dimensions. The Dimensions provide the Who (Customer, Employee), the where (Division) and the when (Calendar, Time of Day) context for the Sales Fact.
Slowly Changing Dimension Type 1 – Overwrite
The Slowly Changing Dimension (SCD) Type 1 change-handling strategy is used when it doesn’t make sense to keep an attribute’s history or when users aren’t interested in it.
When an attribute’s value changes, the SCD Type 1 code simply overwrites the existing value with the new value. Essentially, it records no history of the change.
Our example has an Employee Dimension associated with a Sales fact. In the Employee Dimension, an Employee Surname is a good example of a Type 1 SCD Attribute. If an Employee’s Surname changes due to marriage, you want to overwrite the Surname with the new value.
Since there is only one version of each Employee in the Dimension, all historical Sales Fact records remain associated with the single Employee record, and all future Sales Fact records are associated with the same Employee record.
Now imagine a report that shows Sales by Employee Name. All Sales are reported against the Employees new name. A desirable outcome.
If you kept multiple versions of the Employee (e.g., the SCD Type 2 strategy) and looked at the Sales by Employee Name report, the Employee would show up twice, once with the new name and once with the old name. This is an undesirable outcome.
SCD Type 1 is the most used strategy for handling changes, often serving as developers’ go-to method by default (for better or worse). It’s also the easiest to implement. Hmmm…
Slowly Changing Dimension Type 2 – Add a new Version
You use the Slowly Changing Dimension Type 2 strategy to ensure that reports present a historically accurate data view. The SCD Type 2 change strategy is designed to keep a history of changes to Type 2 attributes. When a Type 2 attribute value changes, the ETL code pattern I describe later creates a new row. The new row represents a new version of the given Dimension Member. Therefore, there can be more than 1 row for each Dimension Member. In order for reports to present history accurately, The Dimension load pattern must work in tandem with associated Fact table‘s load Pattern. I discuss that further later.
Take the SCD Type 2 example below, which shows a Sales Division Dimension. The example treats the Region attribute as a Type 2 attribute. In the example, The Sales dept decided to move the “Richmond” Division from the “North East” Region to the “East Coast” Region on September the 1st, 2024. As the Region attribute is an SCD Type 2 attribute, this change will result in the creation of a new version row for the “Richmond” Division. After September the 1st, 2024, there will be 2 rows in the Dimension for the same “Richmond” Division Dimension Member, one with the old Region and one with the new Region.
Note that the system has assigned the new record a new surrogate key (45). Also, note that the system has tagged the new record as the “latest” version and set the “Row Effective Date” to the date of the change. The system has also set the “Row End Date” to the date of the change for the superseded version row ( division skey = 1). The How to Create a Slowly Changing Dimension section provides a detailed discussion of this pattern.
As I mentioned earlier the SCD Type 2 pattern needs to work in tandem with the Fact load pattern to accurately present history. To determine which version of the Dimension member to associate with any given Fact, the Fact table load pattern should use a combination of the business key and effective dates. Generally, a Fact will contain some transaction date. The Fact finds the correct Dimension Member version using the business key, where the Fact transaction date is between the Dimensions Effective and End dates.
Using this logic, all existing Sales Facts prior to September 1st, 2024, remain associated with the original version of the Richmond Division, and all Sales Facts after that date are associated with the new version. If you were to look at a Sales by Region report, Sales prior to September 1st, 2024 appear in the “North Coast” Region, and Sales after in the “East Coast” Region.
Slowly Changing Dimension Type 3 – Latest and Prior Version
With the Slowly Changing Dimension Type 3 change-handling strategy, you only capture the latest and prior values instead of capturing every version of the attribute value, as you do with SCD Type 2. There are some rare circumstances where this is desirable. Take the previous Type 2 SCD example of a Division shifting Regions. What if the business wanted to see today’s Sales attributed to the new Region but also “as if” they had remained in the old Region, just to compare the performance of the old Region to the new one? The SCD Type 2 strategy does not accommodate this kind of request as it never associates new sales with the old Region. A pattern that associates a Fact with both the Old and New Region is necessary, thus the need for the SCD Type 3 pattern.
SCD type 3 satisfies this comparative kind of reporting requirement. In the SCD Type 3 strategy, you don’t create a new row for a new version of the Dimension member. Instead, the Dimension contains two columns. One for the current value and one for the prior value of the same attribute.
For example:
Revisiting our Richmond Division example, the Richmond Division was in the “North” East Region before the change. The Dimension only contained a single column for the Region attribute:
To accommodate the Type 3 change, you add a “prior Region” column to the table. When the Richmond Division moves to the ‘East Coast’ Region, the system transfers the ‘North East’ value to the ‘prior Region’ column and records ‘East Coast’ as the new value in the Region column. See the result below.
Now, the single version of the Dimension member is associated with all Sales Facts, both new and old. Analysts have the option to select the current Region, Prior Region, or both in their reports. This selection enables them to compare sales between the current and prior Regional organizations.
Note that we don’t recommend implementing the Type 3 ETL as a separate pattern. Instead, we recommend emulating the Type 3 pattern. Read more about our recommended approach to Slowly Changing Dimensions and implementing SCD Type 3.
Slowly Changing Dimension (SCD) Type 4 – Add a history table
Slowly Changing Dimension Type 4 is similar to SCD Type 2, except it stores the historical versions of Dimension Members in a separate table.
The Dimension is divided into two separate current and history tables. The current table contains only the latest version of each Dimension Member, while the history table contains both the historical and current versions. Both tables share the same surrogate keys.
SCD Type 4 proves useful when users want to analyze one Fact table that accurately reflects history and another Fact table where history is irrelevant. In such a case, the historical Dimension table is associated with the ‘historical’ Fact table, and the ‘current’ Fact table uses the current Dimension table.
SCD Type 4 can also be used to report both a historical and current perspective on the same Fact. In this instance, the Fact is joined to both the Historical and Current Dimensions with 2 separate foreign keys. The outcome is similar to the SCD Type 6 approach below, except there is both a Historical and Current Dimension instead of one Dimension with Historical and Current Attributes.
I don’t recommend Type 4. The ETL to manage 2 tables is more difficult, and there are issues in reload scenarios, etc. Instead, I recommend adding a ‘latest’ management column to your Type 2 dimensions and using that column when loading Fact tables to associate them with either the historical or current version. See the Implementing SCD Type 4 below.
Slowly Changing Dimension (SCD) Type 6 Hybrid – Latest vs Historical Versions
The Slowly Changing Dimension Type 6 approach is useful for enabling accurate history reporting while also supporting the ability to report historical data “as if” it is attributed to the “current” or “prior” versions of history.
Type 6 combines Types 1, 2 and 3 (1+2+3 = 6). It combines the historical versioning described in Type 2 SCD with the Current vs Prior version described in Type 3. To achieve the Type 6 pattern, you add a Type 3 “prior” and Type 1 “current” attribute to the Dimension. The original attribute contains the Type 2 version of history, and the system updates the “current” attribute using a Type 1 SCD change strategy and updates the “prior” attribute using a Type 3 change-handling strategy.
For example, the following images show the Richmond Division, originally belonging to the “Northeast” Region. When the Division moves to the “Mid Coast”, the system creates a new version of the Richmond Division Dimension Member. The original “Region” attribute tracks the Type 2 history of the Region attribute, while the “Current Region” tracks the “Type 1” latest value of the Region attribute. The “Prior Region ” attribute tracks the Type 3 prior value of the Region attribute for each version. Again, the process repeats when the Division moves to the “South Coast” region.
You can use the “Region” attribute to report on sales based on the accurate history of the Division’s Region. The “Current Region” attribute allows you to report on sales based on the latest Region, while the “Prior Region” attribute lets you report on sales based on the prior Regional organization.
Type 6 is powerful and provides analytical options, but users rarely employ it, mainly due to the complexity of data and schema management. Imagine having a Current and Prior attribute for every attribute. You should apply Type 6 selectively, although determining which attributes the business wants to analyze in this way can be challenging up-front. Again, our recommended approach suggests emulating Type 6 slowly changing dimensions.
How to Create a Slowly Changing Dimension
Slowly Changing Dimensions Recommended Approach
Our guiding principles are:
- KISS (Keep it Simple Stupid).
- Less is More. Implement as few ETL patterns as possible.
- Be consistent. Don’t do “special cases” or skip implementation details for “simple” Dimensions.
Our recommended approach is as follows:
- Only implement Type 1 and Type 2 SCD in your Dimensions.
- Stick consistently to a single pattern for all your Dimensions, regardless of whether they contain only Type 1 attributes, only Type 2 attributes, or both. Consistency makes your code simpler to maintain.
- Less is more. SCD Type 3, 4, and 6 are so rare that it’s not worth maintaining a different ETL code pattern to load them. Instead, in the rare cases where they are required, you emulate these patterns. For more detail, see the sections on creating Types 3, 4, and 6.
Shortcut: We’ve developed an SQL script to implement the SCD Type 1 and 2 pattern. Why not test it out for yourself and get a copy of the code.
Slowly Changing Dimension Example
I’ve covered the basics of slowly changing dimensions. To get a better understanding of how to implement the ETL associated with each SCD Type, I’ll use the following example:
The Star Schema below depicts a typical Sales scenario. Employees sell Products to Customers. The Sales organization is divided into Divisions. Each Sale is a business event and, therefore, can be modelled as a “Sales” Fact. The diagram below depicts the Sales Fact and its related Dimensions. The Dimensions provide the who (Customer, Employee), the where (Division) and the when (Calendar, Time of Day) context for the Sales Fact.
Our example will focus on the Customer Dimension. The following sections outline the modelling and ETL patterns you need to create and manage this Dimension.
A Universal Data Model for Dimensions
As discussed previously, you should only implement Type 1 and 2 Slowly Changing Dimension change strategies in your Dimensions (and emulate Type 3,4 and 6). Sticking to our principles, we want to define a consistent data model for all our Dimensions regardless of whether they contain only Type 1 attributes, only Type 2 attributes, or both. You never know when a requirement might emerge to change an attribute from Type 1 to Type 2 or vice versa.
As discussed in “Introduction to Dimensions“, I showed the following example of the Customer Dimension:
The “Introduction to Dimensions” article describes the various types of columns in a Dimension. For the purposes of this exercise, we are mostly interested in the “row management columns.” The Dimension ETL patterns use these columns to manage appropriate Inserts and Updates, speed up ETL execution, and facilitate change logging and auditing.
The Row Management Columns are as follows:
- row_is_latest. Indicates the row version is the latest row version, 0 = row is a superceded version.
- row_is_deleted. Indicates that the source system has deleted a row with the same business key.
- row_effective_date. The start date and time (to the microsecond) that the row version becomes effective. For the first row version, row_effective_date is “low date” (i.e. 0001-01-01…). Fact table ETL patterns require Dimensions to have effective dates to associate Facts with the correct version of Dimension Members accurately.
- row_end_date. The end date and time (to the microsecond) that the row version becomes no longer effective. This is always the effective date of the next row version in the series of row versions. Therefore, row versions’ effective date and end dates are contiguous. For the last row version, row_end_date is “high date” (9999-12-31….). The BETWEEN SQL clause can’t be used to join across row effective and end dates. Instead, use a clause similar to this: X >= Row_Effective_Date and X < Row_End_Date, where X is the transaction date of the Fact.
- row_inserted_batch_id. The system creates a new batch ID for every batch execution and sets the row_inserted_batch_id to the ID of the batch that inserted the row.
- row_updated_batch_id. The system sets the row_upaated_batch_id to the last batch Id that updated the row.
- row_scd_type_1_hash. A hash of all SCD Type 1 attribute values. Used by the ETL to compare incoming rows to detect any Type 1 changes.
- row_scd_type_2_hash. A hash of all SCD Type 2 attribute values. Used by the ETL to compare incoming rows to detect any Type 2 changes.
- row_supercedes_skey Optional. Records the surrogate key of the row that this row has superseded due to a Type 2 change. Used to speed up ETL execution and provides a convenient method of viewing and auditing the history for any given Dimension Member. Since Dimensions are generally relatively small, there is no harm in including this column.
- row_last_tran_code Optional. A useful indicator that helps debugging and audit exercises. Since Dimensions are generally relatively small, there is no harm in including this column.
This seems like a lot of columns, but each has its purpose. Dimensions are generally small, so the increase in storage is negligible. I, therefore, would argue to keep all columns. However, there are options to reduce the number of columns:
- The row_supercedes_skey and row_last_tran_code are optional. Depending on how you manage ETL runs, the batch_id columns may not be needed.
- The row_is_latest and row_is_deleted are bit data types, meaning they occupy one bit of a shared byte on the record. This leaves room for 6 additional bit fields that don’t occupy any additional space. You could potentially use 2 bits for the row_last_tran_code, but all you save is one byte per row.
- The largest columns are row_scd_type_1_hash and row_scd_type_2_hash, at 20 bytes each. You could recalculate them each time the ETL runs instead of storing them. However, this means additional coding and processing overhead. I opt to store them to make the ETL simpler and faster.
The Type 1 and 2 SCD Dimension Code Pattern
The following code implements the Type 1 and 2 SCD ETL code pattern for the example Customer Dimension. The Customer Dimension contains a mix of SCD Type 1 and 2 attributes:
- The Type 1 attributes include name, language and employer.
- The Type 2 attributes include language, type, income_level, marital_status and credit_profile.
Shortcut: We’ve developed an SQL script to implement the SCD Type 1 and 2 pattern. Why not test it out for yourself and get a copy of the code.
The SQL code, originally written for a PostgreSQL database, can be easily modified to run on various platforms such as Snowflake, Databricks, Microsoft Fabric, Google Big Query, or AWS Redshift. The great thing about SQL is that it is ubiquitous across platforms, although each platform has slightly different SQL dialects.
SQL uses a series of chained common table expressions (CTE). CTEs are temporary named result data sets that only exist during the execution of a single query. They act like virtual tables that can be referenced later in other CTEs or the main query.
The advantages of this approach are:
- CTEs (Common Table Expressions) generate temporary results, such as the delta result set below, which the system evaluates only once. However, these results can serve multiple subsequent queries, thereby enhancing performance.
- PostgreSQL (and some other SQL dialects) can include INSERT, UPDATE and DELETE statements in CTEs.
- PostgreSQL also has a feature where INSERTs, UPDATEs, and DELETEs can output a result set of updated rows. This can improve performance, as the result of one update may serve as the input for another, such as the type_2_supercede CTE shown in the following code.
- You can use this single, self-contained script as a pattern for all your Dimensions.
Below is the code consisting of the following 5 chained CTEs and a final query:
- Source CTE. Composes the source data.
- Delta CTE. Determines the delta change set.
- Inserts CTE. Insert the New and Type 2 change records. Returns a list of surrogate keys of superceded type 2 rows.
- Type_2_supercede CTE. This CTE updates rows superseded by new Type 2 versions.
- Type_1_update CTE. Applies Type 1 changes.
- Soft delete final query. A “soft” delete of any rows no longer present in the source.
The SQL is explained further following the script.
/************************************************************************* DIMODELO.COM Type 1 and 2 Dimension ETL pattern **************************************************************************/ CREATE OR REPLACE PROCEDURE dim_customer_type_1_and_2_update(batch integer) LANGUAGE SQL AS $$ -- compose the "source" WITH "source" AS ( SELECT customer_id, "name", "language", "type", employer, income_level, marital_status, credit_profile, digest("type" || ',' || income_level || ',' || marital_status || ',' || credit_profile ,'sha1') as row_scd_type_2_hash, digest(name || ',' || language || ',' || employer ,'sha1') as row_scd_type_1_hash FROM stg.customers ), -- derive delta change set delta AS ( SELECT "source".*, dim.skey as dim_skey, CASE WHEN (dim.customer_id = "source".customer_id) AND dim.row_scd_type_2_hash != "source".row_scd_type_2_hash THEN '2' -- type 2 change WHEN (dim.customer_id = "source".customer_id) AND dim.row_scd_type_1_hash != "source".row_scd_type_1_hash THEN '1' -- type 1 change WHEN dim.customer_id is NULL THEN 'N' --new WHEN "source".customer_id is NULL THEN 'D' --deleted ELSE 'unknown' END AS tran_type FROM "source" FULL OUTER JOIN whs.dim_customer dim ON dim.customer_id = "source".customer_id AND dim.row_is_latest = CAST(1 AS BIT) WHERE (dim.customer_id = "source".customer_id) AND dim.row_scd_type_2_hash != "source".row_scd_type_2_hash -- type 2 changes condition OR (dim.customer_id = "source".customer_id) AND dim.row_scd_type_1_hash != "source".row_scd_type_1_hash -- type 1 changes condition OR dim.customer_id is NULL -- new records condition OR (dim.row_is_latest = CAST(1 AS BIT) AND "source".customer_id is NULL) -- deleted records condition ), -- insert new and type 2 records inserts as ( INSERT INTO whs.dim_customer (customer_id, "name", "language", "type", employer, income_level, marital_status, credit_profile, row_is_latest, row_is_deleted, row_effective_date, row_end_date, row_inserted_batch_id, row_updated_batch_id, row_scd_type_1_hash, row_scd_type_2_hash,row_supercedes_skey, row_last_tran_code) SELECT customer_id, "name", "language", "type", employer, income_level, marital_status, credit_profile, CAST(1 AS bit), CAST(0 AS BIT), CASE WHEN dim_skey IS NOT NULL THEN CURRENT_TIMESTAMP(6) ELSE '0001-01-01' END, '31-12-9999', batch, NULL, row_scd_type_1_hash,row_scd_type_2_hash, dim_skey, tran_type FROM delta WHERE delta.tran_type in ('2', 'N') RETURNING row_supercedes_skey ), -- update superceded records (by type 2 changes) type_2_supercede as ( UPDATE whs.dim_customer AS dim SET row_end_date = CURRENT_TIMESTAMP(6), row_is_latest = CAST(0 AS BIT), row_updated_batch_id = batch, row_last_tran_code = 'S' FROM inserts WHERE dim.skey = inserts.row_supercedes_skey ), -- update records with type 1 changes type_1_update as ( UPDATE whs.dim_customer AS dim SET "name" = delta."name", "language" = delta."language", employer = delta.employer, row_updated_batch_id = batch, row_scd_type_1_hash = delta.row_scd_type_1_hash, row_last_tran_code = tran_type FROM delta WHERE delta.tran_type = '1' AND dim.customer_id = delta.customer_id ) -- finally soft deletes UPDATE whs.dim_customer AS dim SET row_updated_batch_id = batch, row_end_date = CURRENT_TIMESTAMP(6), row_is_deleted = CAST(1 AS bit), row_last_tran_code = tran_type FROM delta WHERE dim.skey = delta.dim_skey AND delta.tran_type = 'D' ; $$;
Get the Type 1 and 2 SCD Code
Why not get the code and test it for yourself? Sign up to download the code plus a free test rig, including a test script, portable PostgreSQL server (no install required), Beekeeper Studio lightweight SQL IDE portable (no install required) and video instructions. You’re free to adapt the code and use it in your own data warehouse.
Already signed up? Try our passwordless login instead.
Login
Type 1 and 2 ETL Pattern Explained
The following diagram and dot points break down and describe the code in detail:
- Staging Sources. This pattern assumes that we have loaded the source data into accessible staging tables, following the standard data warehouse architecture. The staging source is either a periodic full copy of source tables/data or a persistent copy of staging data that replicates the source.
- Compose Source (source CTE): The Data source of a Dimension could be a complex join across a set of source staging tables. This step composes the source from its source staging tables, selects values for all the Dimension attributes, and calculates the row_scd_type_1_hash and row_scd_type_2_hash Hashes for the source rows. The code concatenates and hashes Type 1 identified columns into the row_scd_type_1_hash, and similarly, it concatenates and hashes Type 2 identified columns into the row_scd_type_2_hash. The Hashes use the SHA-1 (SHA1 – Wikipedia) hashing algorithm. The SHA-1 algorithm remains suitable for “comparison” purposes, although it is no longer deemed secure for encryption. It generates a relatively compact hash value that is adequately free of collisions and enjoys broad support across databases. MD5 is not advised due to its high rate of collisions, while SHA-256 and higher algorithms yield hash values that are 50% larger.
- “source”. The temporary result of the “Compose Source” step.
- Derive delta change set. In this step, the process derives a data set of the records that have changed in the source when compared to the Dimension.
- First, the source is joined to the Dimension on the business key of the Dimension (in this case, customer_id)using a FULL OUTER JOIN. A Full Outer Join is necessary to identify deletes. FULL OUTER JOINs can be expensive, but Dimensions are generally small, so performance is rarely an issue. If a full outer join wasn’t used, you need a second separate query. Weighing these two facts, it’s better to do a single Full Outer Join.
- The WHERE clause selects rows where a matching row (based on the business key) exists in both the source and Dimension, and the Source and Dimension Hashes are different, indicating a Type 1 or 2 change. It also identifies new records (i.e., in the Source but not in the Dimension) and deleted records (i.e., in the Dimension but not in the Source).
- The query derives a tran_type column to indicate the detected change type. The query sets the tran_type to “1” for a Type 1 change, to “2” for a Type 2 change, to “N” for a new record, or to “D” if the record is deleted.
- The query SELECTs everything in the source table plus the surrogate key of the matching Dimension row (if there is one). The “Supercede Old Type 2 Records” step uses the surrogate key later.
- “delta”. The “Derive Changes” step outputs the temporary delta change data set. Subsequent Update statements use the “delta” data set multiple times. Creating a “delta” dataset has the advantage of a one-time initialization, repeated utilization, and operation with a significantly smaller dataset. This smaller dataset can be easily filtered by transaction type in subsequent updates, thus improving performance.
- Insert New and Type 2 Records. The “Insert New Records” inserts new Dimension Members that don’t already exist, along with new versions of any Dimension Members subject to an SCD Type 2 change. I.e. delta record with tran_type equal to 2 or N. The RETURNING clause returns a set of rows that contain the surrogate keys of the rows that were superseded by new Insertions.
- Updated Superceded Records. Updates the previous “latest” version of the Dimension Member, which the new version has superseded, to set its row_effective_end_date and row_is_latest flag. The previous “Insert New Record” step outputs a list of surrogate keys of superseded rows. This Update statement updates every row in that list, setting its row_effective_end_date and row_is_latest flag.
- Update Records with Type 1 Changes. The Type 1 Update statement updates any record when it identifies a type 1 change (tran_type = 1). It updates all records in the Dimension that match any business key of rows in “delta” where a tran_type =1. You update all versions of the matching Dimension Member (based on the business key).
- Soft Deletes. Finally, the query “soft” deletes any records in the Dimension corresponding to deleted rows in the source. A soft delete sets the row_is_deleted flag and row_end_date of the Dimension row rather than deleting the row from the Dimension. You only “soft” delete data because historical Facts in Fact tables may still reference the deleted Dimension member.
Creating a SCD Type 3 Dimension
As discussed, the SCD Type 3 change-handling strategy only captures the latest and prior value of an attribute in separate columns.
In our recommended approach we suggested only implementing Type 1 and 2 SCD ETL patterns and to emulate SCD Type 3. Developers use the Type 3 Slowly Changing Dimension (SCD) pattern very rarely, and investing in a specific pattern is not worthwhile when simple emulation is possible. To emulate Type 3 SCD, first the attribute must be a Type 2 attribute. If it’s not, then change it, then create a view similar to this example:
In our example we want to show the latest and prior values of the credit_profile attribute of the Customer Dimension. Use the SQL lag windows function. The query is as simple as this:
CREATE OR REPLACE VIEW whs.dim_customer_type_3 AS SELECT * FROM ( SELECT *, LAG(credit_profile,1,NULL) OVER(PARTITION BY customer_id ORDER BY skey) AS prior_credit_profile FROM whs.dim_customer) S WHERE row_is_latest = CAST(1 AS BIT);
The output looks like this:
Implementing a Type 4 SCD change-handing strategy
SCD Type 4 is similar to SCD Type 2, except you keep the historical versions of Dimension Members in a separate table. You divide the Dimension into a current table and a history table.
In our recommended approach, we suggested only implementing Type 1 and 2 SCD and instead emulating SCD Type 4. Developers use the Type 4 Slowly Changing Dimension (SCD) pattern very rarely, and investing in a specific pattern is not worthwhile when simple emulation is possible. In fact, we recommend never using Type 4. However, to emulate Type 4 SCD, create a view representing the “current” table, similar to this example:
CREATE OR REPLACE VIEW whs.dim_customer_current AS SELECT * FROM whs.dim_customer WHERE row_is_latest = CAST(1 AS BIT);
The existing table represents the historical table.
Creating a SCD Type 6 Dimension
The Type 6 approach is beneficial for reporting historical data accurately and simultaneously allowing the data to be presented as if it were attributed to the current version of history.
Type 6 is a combination of types 1, 2, and 3 (1+2+3 = 6). It combines the historical versioning described in Type 2 SCD with the Current vs. Prior version described in Type 3.
In our recommended approach, we suggested only implementing Type 1 and 2 SCD and just emulating SCD Type 6. Developers use the Type 6 Slowly Changing Dimension (SCD) pattern very rarely, and investing in a specific pattern is not worthwhile when simple emulation is possible. To emulate Type 6 SCD, first, the attribute must be a Type 2 attribute. If it’s not, then change it, then create a view similar to this example:
In our example, we want to show the current, prior and historical values of the credit_profile attribute of the Customer Dimension. You can achieve this with an SQL LAG Window function and a JOIN to the “latest” version of each Dimension Member. The query is as simple as this:
CREATE OR REPLACE VIEW whs.dim_customer_type_6 AS SELECT prior.*, cur.current_credit_profile, LAG(credit_profile,1,NULL) OVER(PARTITION BY customer_id ORDER BY skey) AS prior_credit_profile FROM whs.dim_customer "prior" JOIN ( SELECT customer_id AS current_customer_id, credit_profile AS current_credit_profile FROM whs.dim_customer WHERE row_is_latest = CAST(1 AS BIT) ) cur ON prior.customer_id = cur.current_customer_id
The output looks like this: