Binaryworld - Deleting Duplicate Records ... [ SQL Server -> T-SQL ]

May of us have faced duplicate records issue in SQL Server Table. First, I'll need some duplicates to work with. I use following script to create a table called dup_authors in the pubs database.

Click here to copy the following block

-- If the table exists, drop it
use pubs
go
If exists (select * from INFORMATION_SCHEMA.Tables where TABLE_NAME = 'dup_authors')
drop table dup_authors

-- put the records in the database
select au_lname, au_fname, city, state
into dup_authors
from authors

-- duplicate the CA records
insert dup_authors
select au_lname, au_fname, city, state
from authors
where state = 'CA'

-- duplicate the KS record, twice
insert dup_authors
select au_lname, au_fname, city, state
from authors
where state = 'KS'

insert dup_authors
select au_lname, au_fname, city, state
from authors
where state = 'KS'

-- identify the duplicated records
select au_lname, au_fname, city, state, count(*)
from dup_authors
group by au_lname, au_fname, city, state
order by count(*) desc, au_lname

It selects a subset of the columns and creates some duplicate records. At the end it runs a SELECT statement to identify the duplicate records:

Click here to copy the following block

select au_lname, au_fname, city, state, count(*)
from dup_authors
group by au_lname, au_fname, city, state
having count(*) > 1
order by count(*) desc, au_lname, au_fname

The easiest way I know of to identify duplicates is to do a GROUP BY on all the columns in the table. It can get a little cumbersome if you have a large table. My duplicates look something like this:

au_lname    au_fname city         state
--------------- ---------- -------------------- ----- -----------
Smith      Meander  Lawrence       KS  3
Bennet     Abraham  Berkeley       CA  2
Carson     Cheryl   Berkeley       CA  2
except there are thirteen additional duplicates identified.

Second, backup your database. Third, make sure you have a good backup of your database.

Temp Table and Truncate

The simplest way to eliminate the duplicate records is to SELECT DISTINCT into a temporary table, truncate the original table and SELECT the records back into the original table. That query looks like this:

Click here to copy the following block

select distinct *
into #holding
from dup_authors

truncate table dup_authors

insert dup_authors
select *
from #holding

drop table #holding

If this is a large table, it can quickly fill up your tempdb. This also isn't very fast. It makes a copy of your data and then makes another copy of your data. Also while this script is running, your data is unavailable. It may not be the best solution but it certainly works.

Rename and Copy Back

The second option is to rename the original table to something else, and copy the unique records into the original table. That looks like this:

Click here to copy the following block

sp_rename 'dup_authors', 'temp_dup_authors'

select distinct *
into dup_authors
from temp_dup_authors

drop table temp_dup_authors

This has a couple of benefits over the first option. It doesn't use tempdb and it only makes one copy of the data. On the downside, you'll need to rebuild any indexes or constraints on the table when you're done. This one also makes the data unavailable during the process.

Create a Primary Key

Our last option is more complex. It has the benefit of not making a copy of the data and only deleting the records that are duplicates. It's main drawback is that we have to alter the original table and add a sequential record number field to uniquely identify each record. That script looks like this:

Click here to copy the following block

-- Add a new column
-- In real life I'd put an index on it
Alter table dup_authors add NewPK int NULL
go

-- populate the new Primary Key
declare @intCounter int
set @intCounter = 0
update dup_authors
SET @intCounter = NewPK = @intCounter + 1

-- ID the records to delete and get one primary key value also
-- We'll delete all but this primary key
select au_lname, au_fname, city, state, RecCount=count(*), PktoKeep = max(NewPK)
into #dupes
from dup_authors
group by au_lname, au_fname, city, state
having count(*) > 1
order by count(*) desc, au_lname, au_fname

-- delete dupes except one Primary key for each dup record
delete dup_authors
from dup_authors a join #dupes d
on d.au_lname = a.au_lname
and d.au_fname = a.au_fname
and d.city = a.city
and d.state = a.state
where a.NewPK not in (select PKtoKeep from #dupes)

-- remove the NewPK column
ALTER TABLE dup_authors DROP COLUMN NewPK
go

drop table #dupes

It's actually possible to combine the SELECT INTO #dupes and the DELETE into one DELETE statement. My script is easier to read and understand and shouldn't be much slower. This will run a single delete statement against your table and only remove the duplicate records. If you have a large table the join statement can get kind of large. I guess that's the price you pay for letting duplicates into your database.