SSIS and Sql Server Journey

Personal Notes of Sarabjit Singh and Bhavpreet Singh

SSIS: Remove duplicate rows from input file

on March 23, 2013

We have an requirement in which there are 40,000,000 (forty million) records in the Input file. We have to load it into the databases after implementing a few business logic transformations. Major part of concern is to remove duplicate rows. Following are the possible solutions.

1. Use Sort task and set “remove duplicate” property to TRUE. Would take quite long as it is a full blocking transaction. 

2. Use Script component. Need to compare. I guess it would again be a full blocking transformation. 

3.Dump data to DB and then reload after selecting distinct records. Looks like the best option. 

Will be back with the stats. Feel free to add your suggestions/stats/comments.

Happy SSISing.. 🙂

Advertisements

One response to “SSIS: Remove duplicate rows from input file

  1. And Here are the stats:
    the package with sort transaction tool more than an hour where as the one which dumps data to Db and then selects distinct and reinsert took around 3 mins.
    Script component will surely take more time Db task as it would again be using buffer memory.
    Hope it helps 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: