Wednesday, May 7, 2008

Remove repeating rows with SQL Server 2005

When we are performing Data Cleansing in a Data Warehouse environment, often we should take care of tables that has repeating rows (e.g. when there's no primary keys). There's no easy way to remove these repeating rows in a single query since SQL Server 2005, by the OVER statement.
I'll take a customers table to show how it works:


SELECT * FROM (
SELECT rn = row_number() OVER (PARTITION ON c.NIE ORDER BY c.DateModified), c.NIE, c.FirstName, c.Surname, ...
FROM Customers c)
WHERE rn = 1


This query will return to us only the last updated row of every customer, even if the source table had more than one row per customer and modified date.
We can also perform delete statements like this:


DELETE tbl FROM (
SELECT rn = row_number() OVER (PARTITION ON c.NIE ORDER BY c.DateModified) FROM Customers c) AS tbl
WHERE rn <> 1