Removing Duplicates in MS SQL 2005+ Table w/o Adding an UID

Let’s say we have a table that links products to a category:

productID AS INT
categoryID AS INT

For some reason we managed to get dulicate productID’s and categoryID’s. I don’t really care to add a unique ID (UID) field to reference against using this script:

ALTER TABLE productCats
ADD RowID INT IDENTITY(1,1)
GO

DELETE productCats
FROM productCats
LEFT OUTER JOIN (
    SELECT MIN(RowID) AS RowID, productID, categoryID
    FROM productCats
    GROUP BY productID, categoryID
) AS KeepRows ON productCats.RowID = KeepRows.RowId
WHERE KeepRows.RowId IS NULL
GO

ALTER TABLE productCats
DROP COLUMN RowID
GO

Here we group together the duplicates into one row and assign it the minimum RowID for that group. If the left join returns a NULL when comparing all the RowID’s to the new set of groups then it will be deleted.

In MS SQL 2005+ I can use a Common Table Expression (CTE) to solve this issue:

WITH CTE
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY productID, categoryID ORDER BY ( SELECT 0) ) RN
         FROM productCats)
DELETE FROM CTE
WHERE RN > 1

In this example, we give a row number (RN) to each group of duplicates and delete that row if greater than 1.

For more information on CTE’s see http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx

#cte, #duplicates, #sql, #t-sql