BI, Analytics & Big DataTutorials

What are Slowly changing Dimensions (SCD) and why you need them in your Data Warehouse?

Below given post is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. The book is a quick guide to explore Informatica PowerCenter and its features such as working on sources, targets, transformations, performance optimization, and managing your data at speed.

Our article explores what Slowly Changing Dimensions (SCD) are and how to implement them in Informatica PowerCenter. As the name suggests, SCD allows maintaining changes in the Dimension table in the data warehouse. These are dimensions that gradually change with time, rather than changing on a regular basis. When you implement SCDs, you actually decide how you wish to maintain historical data with the current data. Dimensions present within data warehousing and in data management include static data about certain entities such as customers, geographical locations, products, and so on.

Here we talk about general SCDs: SCD1, SCD2, and SCD3. Apart from these, there are also Hybrid SCDs that you might come across. A Hybrid SCD is nothing but a combination of multiple SCDs to serve your complex business requirements.

Types of SCD

The various types of SCD are described as follows:

Type 1 dimension mapping (SCD1): This keeps only current data and does not maintain historical data.

Note : Use SCD1 mapping when you do not want history of previous data.

Type 2 dimension/version number mapping (SCD2): This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_VERSION_NUMBER) by maintaining the version number in the table to track the changes. We use a new column PM_PRIMARYKEY to maintain the history.

Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a version number.

Consider there is a column LOCATION in the EMPLOYEE table and you wish to track the changes in the location on employees. Consider a record for Employee ID 1001 present in your EMPLOYEE dimension table. Steve was initially working in India and then shifted to USA. We are willing to maintain history on the LOCATION field.

Type 2 dimension/flag mapping: This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_CURRENT_FLAG) by maintaining the flag in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history.

Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a flag.

Let’s take an example to understand different SCDs.

Type 2 dimension/effective date range mapping: This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using two new columns (PM_BEGIN_DATE and PM_END_DATE) by maintaining the date range in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history.

Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using start date and end date.

Type 3 Dimension mapping: This keeps current as well as historical data in the table. We maintain only partial history by adding a new column PM_PREV_COLUMN_NAME, that is, we do not maintain full history.

Note: Use SCD3 mapping when you wish to maintain only partial history.
EMPLOYEE_ID NAME LOCATION
1001 STEVE INDIA

Your data warehouse table should reflect the current status of Steve. To implement this, we have different types of SCDs.

SCD1 As you can see in the following table, INDIA will be replaced with USA, so we end up having only current data, and we lose historical data:

PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION
100 1001 STEVE USA

Now if Steve is again shifted to JAPAN, the LOCATION data will be replaced from USA to JAPAN:

PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION
100 1001 STEVE JAPAN

The advantage of SCD1 is that we do not consume a lot of space in maintaining the data. The disadvantage is that we don’t have historical data.

SCD2 – Version number

As you can see in the following table, we are maintaining the full history by adding a new record to maintain the history of the previous records:

PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_VERSION_NUMBER
100 1001 STEVE INDIA 0
101 1001 STEVE USA 1
102 1001 STEVE JAPAN 2
200 1002 MIKE UK 0

We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID (supposed to be the primary key) column, and PM_VERSION_NUMBER to understand current and history records.

SCD2 – FLAG

As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records:

 

PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG
100 1001 STEVE INDIA 0
101 1001 STEVE USA 1

We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_CURRENT_FLAG to understand current and history records.

Again, if Steve is shifted, the data looks like this:

PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG
100 1001 STEVE INDIA 0
101 1001 STEVE USA 0
102 1001 STEVE JAPAN 1

SCD2 – Date range

As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records:

PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_BEGIN_DATE PM_END_DATE
100 1001 STEVE INDIA 01-01-14 31-05-14
101 1001 STEVE USA 01-06-14 99-99-9999

We add three new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_BEGIN_DATE and PM_END_DATE to understand the versions in the data.

The advantage of SCD2 is that you have complete history of the data, which is a must for data warehouse.

The disadvantage of SCD2 is that it consumes a lot of space.

SCD3

As you can see in the following table, we are maintaining the history by adding new columns:

PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION
100 1001 STEVE USA INDIA

An optional column PM_PRIMARYKEY can be added to maintain the primary key constraints. We add a new column PM_PREV_LOCATION in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2,where we added rows to maintain history.

If Steve is now shifted to JAPAN, the data changes to this:

PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION
100 1001 STEVE JAPAN USA

As you can notice, we lost INDIA from the data warehouse, that is why we say we are maintaining partial history.

Note : To implement SCD3, decide how many versions of a particular column you wish to maintain. Based on this, the columns will be added in the table.

SCD3 is best when you are not interested in maintaining the complete but only partial history.

The drawback of SCD3 is that it doesn’t store the full history.

At this point, you should be very clear about the different types of SCDs. We need to implement these concepts practically in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement SCD. Using the wizard, you can easily implement any SCD. In the next topics, you will learn how to use the wizard to implement SCD1, SCD2, and SCD3.

Before you proceed to the next section, please make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. Wizard creates a mapping using all these transformations to implement the SCD functionality.

When we implement SCD, there will be some new records that need to be loaded into the target table, and there will be some existing records for which we need to maintain the history.

Note : The record that comes for the first time in the table will be referred to as the NEW record, and the record for which we need to maintain history will be referred to as the CHANGED record. Based on the comparison of the source data with the target data, we will decide which one is the NEW record and which is the CHANGED record.

To start with, we will use a sample file as our source and the Oracle table as the target to implement SCDs. Before we implement SCDs, let’s talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD.

  1. Extract all records from the source.
  2. Look up on the target table, and cache all the data.
  3. Compare the source data with the target data to flag the NEW and CHANGED records.
  4. Filter the data based on the NEW and CHANGED flags.
  5. Generate the primary key for every new row inserted into the table.
  6. Load the NEW record into the table, and update the existing record if needed.

In this article we concentrated on a very important table feature called slowly changing dimensions. We also discussed different types of SCDs, i.e., SCD1, SCD2, and SCD3. If you are looking to explore more in Informatica Powercentre, go ahead and check out the book Learning Informatica Powercentre 10.x.

 

Tags

Savia Lobo

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *