In data warehousing, it's common to have to deal with several dimensions, often originating from more than one source system, and with one, or possibly all of them, changing over time.
These slowly-changing dimensions can complicate the ETL process, especially for type 2 SCD where one is required to maintain an historical record of the changed data. In SSIS, we deal with this kind of dimensions using the Slowly Changing Dimension Component, included in the SSIS Data Flow Tasks.
However, one of the biggest problems with use of this component has always been performance. We can find rapidly the performance issues if we take a look on how this component works:
For each row in the incoming pipeline, the component performs a select statement versus the destination database, in order to retrieve the row that is going to save, and compare the changing values looking for changes. If that change happens and there's no historical attributes that have been changed, the component redirects the row to a OLEDB Command component to execute a UPDATE statement for the existing row in the destination table with the new values. If an historical attribute has been changed, then the existing row is updated by a OLEDB component to mark it as out-of-date, and a new row is inserted by an OLEDB Destination.
In this post I will describe the different approaches that we can follow to optimize the performance of our packages when we use Changing Dimensions in our Data Warehouses.
In order to see how the SSIS SCD component works, and how we can optimize the SCD processing, consider an example scenario whereby we have:
• A source database called 'SourceDB', with a 'Product' table
• A destination database called 'DataWarehouseDB', with a 'DimProduct'
Type 1 Slowly Changing Dimensions
When dealing with Type 1 Slowly Changing Dimensions (SCD), we only need to consider updating the information for changing rows, with no requirements to maintain historical fields. If we detect a change in the destination, we update this and the old values will be lost.
Using the SSIS SCD Component
We can build a data flow to handle this situation using the built-in SCD component of SSIS . The final data flow layout should look like this:

For each row in the incoming pipeline, the SCD component performs a select statement against the destination database, in order to retrieve the row that is going to save, and compare the changing values looking for changes. If a change is detected, and since there's no need to maintain historical attributes, the component redirects the row to a OLEDB Command component to execute a UPDATE statement for the existing row in the destination table with the new values.
So, in our example, we simply retrieve the product information from the SourceDB Product table, and then update the DimProduct table in the DataWarehouseDB with the latest changes.
Given the nature of our data flow, as described in figure 1, and the fact that each of these steps must be performed for every row in the incoming pipeline, we can expect to require a substantial amount of processing time for it to complete.
The LookUp/Split optimization with Type 1 SCD
Fortunately, we can perform many optimizations around this component in order to minimize the processing time of our data flow. The first step is to substitute the SCD component, which causes the biggest delay. We have two options at this point:
a) Exchange the SCD component with a Lookup component + Split component
b) Exchange the SCD component with a Source component + Merge Join + Split component
The second option will be mandatory if we want to perform a comparison between fields using anything other than the "equal to" operator (=), for example, to find out if any rows are dated later than a date field contained in the source row. A merge join component will also be necessary if you want to use a data source other than an OLEDB source, because the lookup component only accepts this kind of data sources.
Let's investigate how to perform the Lookup/Split optimization for type 1 SCDs. First, we delete the SCD component, leaving the OLEDB Command and the Insert Destination. Then we add a lookup component that will retrieve the rows from DimProduct. We should only query the changing fields from the DataWarehouse, and the fields that compose the business key.
After that, we add a Conditional Split (CSP) task. In this component, we'll add a condition called 'Update needed' with all the necessary comparisons between the fields to detect the changes.
The new data flow is shown here:

Note that, in the CSP component, we cannot only deal with a single comparison between every pair of values, because the possible null values can throw unhandled exceptions in the component. For example, if we want to compare the 'ProductCategory' field for the source and destination databases, the expression that we should use is:
!((ISNULL(SRCPRODUCTCATEGORY) && ISNULL(DWDIMPRODUCTCATEGORY)) || (!ISNULL(SRCPRODUCTCATEGORY) && !ISNULL(DWDIMPRODUCTCATEGORY) && (SRCPRODUCTCATEGORY == DWDIMPRODUCTCATEGORY)))
With this expression we avoid the null comparison, and we'll update if we have a changed value different than null. We should repeat this expression for each pair of fields that we want to compare.
Now the only thing that we should do to finish with the optimization is to connect the Error Output from the Lookup component to the Insert Destination and the 'Update Needed' Output to the OLEDB Command that performs the updates.
The end result, in our example, is that, the data flow will retrieve the SourceDB products from 'Product' table and DataWarehouseDB products from 'DimProduct' into memory. Then it will compare the changing fields from the two datasets, and it will perform a bulk insert for all new rows and an update for each existing one. The number of necessary database transactions is such smaller than the first SCD approach.
Type 2 SCD
This was the SCD 1 type optimization, but we should use the SCD 2 type every time that a dimension has an historical attribute. If an historical attribute has been changed then rather than just u0pdate the destination row with the new values, the OLEDB component must mark the existing row as out-of-date, and then insert a new row with the new values. For example, if we want to track all the pricing changes in time, we'll have to insert a new row every time that we detect that the prize has been changed in the source system.
Type 2 SCD with the SCD Component
We can design a data flow with that behavior easily with the SCD component. If we select the prize field as a Historical attribute in the SCD wizard, the resultant data flow should look like this:

Type 2 SCD with the LookUp/Split Optimization
The optimization steps are quite similar to the SCD type 1 process. We should delete the SCD component, and add the Lookup on the 'DimProduct' and the Conditional Split component. The lookup component should retrieve both the changing and historical attributes.
In order to manage both SCD type 2 changes we have to split the condition. We'll have a condition for the historical fields and another one for the changing fields. The behavior that we expect is that if a change is detected in a historical field, there's no need to continue looking for changes, because a new row will be needed. To get this behavior, the only thing that we should do is to place the historical condition in the first place, because the conditional split processes the conditions in order.
The next picture shows how the component is configured:

The last step is to connect the convenient outputs to the SCD components and the data flow will be ready to go. The layout of the data flow should look as shown here:

Performance Comparison
In order to compare the two data flows, we are going to execute the package over the Product table from AdventureWorks SQL Server database.
In the Data Loading process, these are the elapsed times for each design:
First load:
- With SCD component: 73 seconds
- With Lookup+Split: 65 seconds
We should take into account that the components validating time is the same for both data flows.
Now we're going to do some changes in the category and price fields from the Product table. Now let's do some changes:

Updating process:
- With SCD component: 74 seconds
- With Lookup+Split: 65 seconds
The results are not as amazing as we hope, but we may think that a dimension of 500 rows is far from being amazing. We'll simulate a 100.000 rows dimension with that SQL statement:
SELECT PRODUCTID, NAME, '1'+ PRODUCTNUMBER AS PRODUCTNUMBER, LISTPRICE, PRODUCTSUBCATEGORYID, MODIFIEDDATE FROM PRODUCTION.PRODUCT UNION ALL
SELECT PRODUCTID, NAME, '2'+ PRODUCTNUMBER AS PRODUCTNUMBER, LISTPRICE, PRODUCTSUBCATEGORYID, MODIFIEDDATE FROM PRODUCTION.PRODUCT UNION ALL
[...]
SELECT PRODUCTID, NAME, '200'+ PRODUCTNUMBER AS PRODUCTNUMBER, LISTPRICE, PRODUCTSUBCATEGORYID, MODIFIEDDATE FROM PRODUCTION.PRODUCT
After truncating the DimProduct table and perform the previous tests again the results are:
Load process:
- With SCD component: 11'08'' minutes
- With Lookup+Split: 2'10'' minutes
Updating process:
- With SCD component: 178'05'' minutes
- With Lookup+Split: 121’23’’ minutes
The results are much more significant, and we can now see the improvement in the performance of the data flows. We should also take into account that if we would really have a 100.000 rows dimension the server would has to move much more disk pages and the results would be even much more significant.
Conclusion
We can get a significant improvement of performance of dimension processing packages if we implement these optimizations in our data flows. The SCD component really slows down the process and it can be improved very easily and quickly.
We can even optimize one step beyond if we modify the update OLEDB command of the updates and combine an insert destination into a temporary table with a joined update in another OLEDB Command.