Monday, April 27, 2009

Custom Data Cleansing with SSIS

Hello guys.
A lot of times I’ve needed to deal with really dirty information, like blank strings, null values and future dates. Most of the times we need to clean these values along several data flows and several packages. We can do it easilly including a Column Transformation component, but we’ll have to declare the transform function for each column.
A different approach uses the Script component to cover every column that you want to clean automatically, via Reflection.
In the Script component we’ll have to select the input columns that we need to clean. Then we declare a loop which will cover each column and will perform a function call to clean the value, depending on the type of the column value.
In the next code sample, I clean the string columns having blank values, and replace the future dates for null dates:


Imports System
Imports System.Data
Imports System.Math
Imports System.Reflection
Imports Microsoft.SqlServer.Dts.Pipeline
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain
Inherits UserComponent

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim str As String
Dim obj As Object
Try
Dim p As PropertyInfo
For Each p In Row.GetType().GetProperties()
'First check, null properties
If p Is Nothing Then
Continue For
End If
'Second check, null values
If p.GetValue(Row, Nothing) Is Nothing Then
Continue For
End If
'Second check by types
If p.PropertyType Is GetType(String) Then
p.SetValue(Row, CleanString(p.GetValue(Row, Nothing).ToString()), Nothing)
End If
If p.PropertyType Is GetType(DateTime) Then
If CType(Row.GetType().GetProperty(p.Name() + "_IsNull").GetValue(Row, Nothing), Boolean) = True Then
Continue For
End If
obj = CleanDate(p.GetValue(Row, Nothing))
If obj Is Nothing Then
Row.GetType().GetProperty(p.Name() + "_IsNull").SetValue(Row, True, Nothing)
Else
p.SetValue(Row, obj, Nothing)
End If
End If
Next
Catch e As Exception
'Do Nothing
End Try
End Sub

Public Function CleanString(ByVal str As String) As String
If str.Trim() = "" Then
Return Nothing
End If
Return str
End Function

Public Function CleanDate(ByVal obj As Object) As Object
Dim dt As DateTime
dt = CType(obj, DateTime)
If dt > DateTime.Now Then
Return Nothing
End If
Return obj
End Function

End Class

Sunday, March 22, 2009

Timestamp Data Extraction with SSIS

Hello Everyone.
Maybe you heard about the different data extraction methods that exist in the ETL world: Timestamp, Modification Date, Bit Column, Checksum, Triggers, Lookup or even with SQL Server 2008 we have the Change Data Capture or the MERGE statement.
Today I’m going to explain the Timestamp approach and how to implement it with SSIS. First of all we have to understand what the timestamp data type is and how to use it.
The timestamp data type consists of a set of bytes that helps us to identify which rows have been modified. When you add a timestamp column to a table (only one per table is allowed), every inserted or updated row is marked with a new value of the timestamp. However, you must take into account it’s not possible to convert a timestamp value to anything similar to a date. It’s only a sequence of bytes, a number that is continuously growing.
We can store it into a control table for each execution of our ETL process. Therefore in every execution we only process the rows that have been modified since the last time.
Let’s see how to do this approach by an example.
We’ll have a source customer table, in which we have added a timestamp column. We’ll also have a destination customer table and a control table to store the last timestamp read from the source table. The tables should look like this:

Now we’ll design the SSIS package. In order not to miss any row, we must retrieve both timestamp values at the beginning, and store them into two SSIS variables. The two select statements are described here:
• The first one retrieves the timestamp read at the last execution, stored in the control table. We store the result value into the TimestampTo SSIS variable.


SELECT LastReadTimestamp FROM LastReadTimestamps WHERE TableName = 'customers'


• The second one retrieves the maximum value of all timestamp stored into the source customers table. We store the result value into the TimestampFrom SSIS variable.


SELECT convert(varchar(20),convert(numeric(20,0),max(timestamp) + 0)) AS timestamp FROM dbo.Customers


In the next control flow step we’ll design the data flow. With a data flow task we can move the modified rows from the source to the destination table. To do this, we can use the two variables as parameters in the OLEDB Source component. The select statement should look like this:


SELECT CustomerID,Name,Surname,Location
FROM Customers
WHERE convert(numeric(20,0),timestamp + 0) > ? AND convert(numeric(20,0),timestamp + 0) <= ?


We need to convert the timestamp to a numeric value to compare it with the parameter. SQL Server is not capable to convert directly from timestamp to a numeric, but placing a “+0” after the column we can solve the problem quickly.
Once we have the modified values we can easily synchronize the destination table with the SCD component. The design of the data flow should look like this:

Finally we store the timestamp of the last read customer in the execution, which is exactly the value stored into the TimestampTo variable. We store the value by a SQL task with a parameter:


UPDATE LastReadTimestamps
SET LastReadTimestamp = convert(numeric(20,0), ?)
WHERE TableName = 'Customers'



With this approach we can easily improve our SSIS data flow performance because we only process the rows that have been modified since the last execution. No extra work is required.
I hope you will find this article interesting, you can find the sample package right here.

Thursday, December 4, 2008

Create a Custom Time Dimension

Hello all.

Recently I've created a custom Time Dimension for a new Data Warehouse in a Financial company, so I used a SQL Server script to create the SQL table and to populate it. Here you have the script that I used:


CREATE TABLE TIME (
TIME_KEY smalldatetime,
[DAY] int,
[MONTH] int,
[QUARTER] int,
[SEMESTER] int,
[HALF] int,
[YEAR] int,
[WEEKDAY] int,
[WEEK_OF_YEAR] int,
[MONTH_NAME] varchar(20),
[QUARTER_NAME] varchar(2),
[SEMESTER_NAME] varchar(2),
[HALF_NAME] varchar(2),
[WEEKDAY_NAME] varchar(10),
[DATE] [smalldatetime],
PRIMARY KEY CLUSTERED (TIME_KEY)
)

SET DATEFORMAT dmy

DECLARE @startDate SMALLDATETIME
DECLARE @currentDate AS SMALLDATETIME
DECLARE @maxDate AS DATETIME
DECLARE @halfNbr AS INT
DECLARE @semesterNbr AS INT
DECLARE @monthName AS VARCHAR(50)
DECLARE @quarterName AS VARCHAR(2)
DECLARE @semesterName AS VARCHAR(2)
DECLARE @halfName AS VARCHAR(2)
DECLARE @weekDayName AS VARCHAR(10)

SET @startDate = '20040101'
SET @maxDate = DATEADD(YEAR, 25, GETDATE())
SET @currentDate = @startDate

WHILE (@currentDate < @maxDate)
BEGIN
SET @semesterNbr = CASE WHEN MONTH(@currentDate) < 5 THEN 1 WHEN MONTH(@currentDate) > 8 THEN 3 ELSE 2 END
SET @halfNbr = CASE WHEN MONTH(@currentDate) < 7 THEN 1 ELSE 2 END
SET @monthName = DATENAME(MONTH, @currentDate)
SET @quarterName = 'Q' + CONVERT(char(1), DATEPART(QUARTER, @currentDate))
SET @semesterName = 'S' + CONVERT(char(1), @semesterNbr)
SET @halfName = 'H' + CONVERT(char(1), @halfNbr)
SET @weekDayName = DATENAME(WEEKDAY, @currentDate)

INSERT INTO [TIME]
VALUES(@currentDate, DAY(@currentDate), MONTH(@currentDate), DATEPART(QUARTER,@currentDate),
@semesterNbr, @halfNbr, YEAR(@currentDate), DATEPART(WEEKDAY, @currentDate),
DATEPART(WEEK,@currentDate), @monthName, @quarterName, @semesterName, @halfName,
DATENAME(WEEKDAY, @currentDate), @currentDate)
SET @currentDate = DATEADD(DAY, 1, @currentDate)
END


Tuesday, November 11, 2008

SCD (Slowly Changing Dimension) optimization in SSIS

In data warehousing, it's common to have to deal with several dimensions, often originating from more than one source system, and with one, or possibly all of them, changing over time.
These slowly-changing dimensions can complicate the ETL process, especially for type 2 SCD where one is required to maintain an historical record of the changed data. In SSIS, we deal with this kind of dimensions using the Slowly Changing Dimension Component, included in the SSIS Data Flow Tasks.
However, one of the biggest problems with use of this component has always been performance. We can find rapidly the performance issues if we take a look on how this component works:
For each row in the incoming pipeline, the component performs a select statement versus the destination database, in order to retrieve the row that is going to save, and compare the changing values looking for changes. If that change happens and there's no historical attributes that have been changed, the component redirects the row to a OLEDB Command component to execute a UPDATE statement for the existing row in the destination table with the new values. If an historical attribute has been changed, then the existing row is updated by a OLEDB component to mark it as out-of-date, and a new row is inserted by an OLEDB Destination.
In this post I will describe the different approaches that we can follow to optimize the performance of our packages when we use Changing Dimensions in our Data Warehouses.
In order to see how the SSIS SCD component works, and how we can optimize the SCD processing, consider an example scenario whereby we have:
• A source database called 'SourceDB', with a 'Product' table
• A destination database called 'DataWarehouseDB', with a 'DimProduct'

Type 1 Slowly Changing Dimensions


When dealing with Type 1 Slowly Changing Dimensions (SCD), we only need to consider updating the information for changing rows, with no requirements to maintain historical fields. If we detect a change in the destination, we update this and the old values will be lost.

Using the SSIS SCD Component


We can build a data flow to handle this situation using the built-in SCD component of SSIS . The final data flow layout should look like this:

For each row in the incoming pipeline, the SCD component performs a select statement against the destination database, in order to retrieve the row that is going to save, and compare the changing values looking for changes. If a change is detected, and since there's no need to maintain historical attributes, the component redirects the row to a OLEDB Command component to execute a UPDATE statement for the existing row in the destination table with the new values.
So, in our example, we simply retrieve the product information from the SourceDB Product table, and then update the DimProduct table in the DataWarehouseDB with the latest changes.
Given the nature of our data flow, as described in figure 1, and the fact that each of these steps must be performed for every row in the incoming pipeline, we can expect to require a substantial amount of processing time for it to complete.

The LookUp/Split optimization with Type 1 SCD


Fortunately, we can perform many optimizations around this component in order to minimize the processing time of our data flow. The first step is to substitute the SCD component, which causes the biggest delay. We have two options at this point:
a) Exchange the SCD component with a Lookup component + Split component
b) Exchange the SCD component with a Source component + Merge Join + Split component
The second option will be mandatory if we want to perform a comparison between fields using anything other than the "equal to" operator (=), for example, to find out if any rows are dated later than a date field contained in the source row. A merge join component will also be necessary if you want to use a data source other than an OLEDB source, because the lookup component only accepts this kind of data sources.
Let's investigate how to perform the Lookup/Split optimization for type 1 SCDs. First, we delete the SCD component, leaving the OLEDB Command and the Insert Destination. Then we add a lookup component that will retrieve the rows from DimProduct. We should only query the changing fields from the DataWarehouse, and the fields that compose the business key.
After that, we add a Conditional Split (CSP) task. In this component, we'll add a condition called 'Update needed' with all the necessary comparisons between the fields to detect the changes.
The new data flow is shown here:

Note that, in the CSP component, we cannot only deal with a single comparison between every pair of values, because the possible null values can throw unhandled exceptions in the component. For example, if we want to compare the 'ProductCategory' field for the source and destination databases, the expression that we should use is:


!((ISNULL(SRCPRODUCTCATEGORY) && ISNULL(DWDIMPRODUCTCATEGORY)) || (!ISNULL(SRCPRODUCTCATEGORY) && !ISNULL(DWDIMPRODUCTCATEGORY) && (SRCPRODUCTCATEGORY == DWDIMPRODUCTCATEGORY)))


With this expression we avoid the null comparison, and we'll update if we have a changed value different than null. We should repeat this expression for each pair of fields that we want to compare.
Now the only thing that we should do to finish with the optimization is to connect the Error Output from the Lookup component to the Insert Destination and the 'Update Needed' Output to the OLEDB Command that performs the updates.
The end result, in our example, is that, the data flow will retrieve the SourceDB products from 'Product' table and DataWarehouseDB products from 'DimProduct' into memory. Then it will compare the changing fields from the two datasets, and it will perform a bulk insert for all new rows and an update for each existing one. The number of necessary database transactions is such smaller than the first SCD approach.

Type 2 SCD


This was the SCD 1 type optimization, but we should use the SCD 2 type every time that a dimension has an historical attribute. If an historical attribute has been changed then rather than just u0pdate the destination row with the new values, the OLEDB component must mark the existing row as out-of-date, and then insert a new row with the new values. For example, if we want to track all the pricing changes in time, we'll have to insert a new row every time that we detect that the prize has been changed in the source system.

Type 2 SCD with the SCD Component


We can design a data flow with that behavior easily with the SCD component. If we select the prize field as a Historical attribute in the SCD wizard, the resultant data flow should look like this:


Type 2 SCD with the LookUp/Split Optimization



The optimization steps are quite similar to the SCD type 1 process. We should delete the SCD component, and add the Lookup on the 'DimProduct' and the Conditional Split component. The lookup component should retrieve both the changing and historical attributes.
In order to manage both SCD type 2 changes we have to split the condition. We'll have a condition for the historical fields and another one for the changing fields. The behavior that we expect is that if a change is detected in a historical field, there's no need to continue looking for changes, because a new row will be needed. To get this behavior, the only thing that we should do is to place the historical condition in the first place, because the conditional split processes the conditions in order.
The next picture shows how the component is configured:

The last step is to connect the convenient outputs to the SCD components and the data flow will be ready to go. The layout of the data flow should look as shown here:

Performance Comparison


In order to compare the two data flows, we are going to execute the package over the Product table from AdventureWorks SQL Server database.
In the Data Loading process, these are the elapsed times for each design:
First load:
- With SCD component: 73 seconds
- With Lookup+Split: 65 seconds
We should take into account that the components validating time is the same for both data flows.
Now we're going to do some changes in the category and price fields from the Product table. Now let's do some changes:

Updating process:
- With SCD component: 74 seconds
- With Lookup+Split: 65 seconds
The results are not as amazing as we hope, but we may think that a dimension of 500 rows is far from being amazing. We'll simulate a 100.000 rows dimension with that SQL statement:


SELECT PRODUCTID, NAME, '1'+ PRODUCTNUMBER AS PRODUCTNUMBER, LISTPRICE, PRODUCTSUBCATEGORYID, MODIFIEDDATE FROM PRODUCTION.PRODUCT UNION ALL
SELECT PRODUCTID, NAME, '2'+ PRODUCTNUMBER AS PRODUCTNUMBER, LISTPRICE, PRODUCTSUBCATEGORYID, MODIFIEDDATE FROM PRODUCTION.PRODUCT UNION ALL
[...]
SELECT PRODUCTID, NAME, '200'+ PRODUCTNUMBER AS PRODUCTNUMBER, LISTPRICE, PRODUCTSUBCATEGORYID, MODIFIEDDATE FROM PRODUCTION.PRODUCT


After truncating the DimProduct table and perform the previous tests again the results are:
Load process:
- With SCD component: 11'08'' minutes
- With Lookup+Split: 2'10'' minutes
Updating process:
- With SCD component: 178'05'' minutes
- With Lookup+Split: 121’23’’ minutes
The results are much more significant, and we can now see the improvement in the performance of the data flows. We should also take into account that if we would really have a 100.000 rows dimension the server would has to move much more disk pages and the results would be even much more significant.

Conclusion


We can get a significant improvement of performance of dimension processing packages if we implement these optimizations in our data flows. The SCD component really slows down the process and it can be improved very easily and quickly.
We can even optimize one step beyond if we modify the update OLEDB command of the updates and combine an insert destination into a temporary table with a joined update in another OLEDB Command.

Monday, October 27, 2008

Business Intelligence Seminars

Hello everybody.
Last month I've really occupied preparing two Business Intelligence seminars to expand the BI competence inside the company I work for. These seminars took place in Barcelona (Spain) and in Andorra la Vella (Andorra).

In Barcelona I was talking three hours about the SQL Server 2008 improvements and a large Case Study about a financial company called GMAC RFC. Nonetheless, in Andorra I have only 45 minutes, so I was talking about the BI current situation in the market and a few representative case studies.
You can see more about the seminars at:


Monday, September 15, 2008

Custom Logging and Output List Messages in SSIS Scripts

One of the most common problems developing SSIS packages is that you cannot debug every step that you perform. Only control flow script tasks can be debugged with the VS integrated debugger, but in most cases that is not enough.
A different approach to debugging for testing purposes can be logging. We can enable logging in the package to log messages from the components, and connect with a log provider to save into a SQL Server DB, text file, etc. But what if we want to log our custom messages in the log?
The answer is that we can use script tasks and script components for custom logging.
Optionally, we can perform a different way to view this information by the VS events. Thus, the messages will be accessible in the "Output List" of Visual Studio.
To log these messages we should add these VB statements into our scripts:

In script components (Data Flow Task):


Me.Log("This is a logging message in a DFT", 0, Nothing)
Me.ComponentMetaData.FireInformation(0, "ComponentName", "This is an information message in a DFT", Nothing, Nothing, False)


In script tasks (Control Flow):


Dts.Log("This is a log message in a Control Flow", 0, Nothing)
Dts.Events.FireInformation(0, "ComponentName", "This is an information message in a Control Flow", Nothing, Nothing, False)


A final step is to enable logging feature into the package, by the SSIS-->Logging menu. We should add a logging provider, and check both the ScriptComponentLogEntry for the DFT component and the ScriptTaskLogEntry for the script task.

We can see the results in the next screenshot.

Thursday, August 28, 2008

Let the Null Values pass through the Lookup transformation in SSIS

After the summer, here I am to communicate something that surprises me and it can be useful for someone (I hope).
When we use the lookup component to retrieve values from another OLEDB source, we always hope that all entries will match with a row in the lookup table. When this situation doesn't happen, we have the ability to let the row pass the lookup component with a null value in the lookup fields.
But what if we want to have a different behavior for the null values that come from the source pipeline? We can have the next situation:
We are translating some category codes from the products source table to a common set of new category codes. We have a translation table that will match every category source code with a new one. But we know that some rows from the source system come with a null value in the category field. For these rows, we want a 'Not Available' as translated value.
The only thing that we should do to get this behavior in the SSIS Lookup component is to perform this little trick into the lookup component statement:


select code, translation from CodeTable
union all
select null, 'Not Available'


As lookup component is able to match null values, the rows with a null value in the category field will be translated into the 'Not Available' value, just as we wanted.