|
|
|
May of us have faced duplicate records issue in SQL Server Table. First, I'll need some duplicates to work with. I use following script to create a table called dup_authors in the pubs database. |
Click here to copy the following block | use pubs go If exists (select * from INFORMATION_SCHEMA.Tables where TABLE_NAME = 'dup_authors') drop table dup_authors
select au_lname, au_fname, city, state into dup_authors from authors
insert dup_authors select au_lname, au_fname, city, state from authors where state = 'CA'
insert dup_authors select au_lname, au_fname, city, state from authors where state = 'KS'
insert dup_authors select au_lname, au_fname, city, state from authors where state = 'KS'
select au_lname, au_fname, city, state, count(*) from dup_authors group by au_lname, au_fname, city, state order by count(*) desc, au_lname |
It selects a subset of the columns and creates some duplicate records. At the end it runs a SELECT statement to identify the duplicate records: |
Click here to copy the following block | select au_lname, au_fname, city, state, count(*) from dup_authors group by au_lname, au_fname, city, state having count(*) > 1 order by count(*) desc, au_lname, au_fname |
The easiest way I know of to identify duplicates is to do a GROUP BY on all the columns in the table. It can get a little cumbersome if you have a large table. My duplicates look something like this:
au_lname au_fname city state --------------- ---------- -------------------- ----- ----------- Smith Meander Lawrence KS 3 Bennet Abraham Berkeley CA 2 Carson Cheryl Berkeley CA 2 except there are thirteen additional duplicates identified.
Second, backup your database. Third, make sure you have a good backup of your database.
Temp Table and Truncate
The simplest way to eliminate the duplicate records is to SELECT DISTINCT into a temporary table, truncate the original table and SELECT the records back into the original table. That query looks like this: |
If this is a large table, it can quickly fill up your tempdb. This also isn't very fast. It makes a copy of your data and then makes another copy of your data. Also while this script is running, your data is unavailable. It may not be the best solution but it certainly works.
Rename and Copy Back
The second option is to rename the original table to something else, and copy the unique records into the original table. That looks like this: |
This has a couple of benefits over the first option. It doesn't use tempdb and it only makes one copy of the data. On the downside, you'll need to rebuild any indexes or constraints on the table when you're done. This one also makes the data unavailable during the process.
Create a Primary Key
Our last option is more complex. It has the benefit of not making a copy of the data and only deleting the records that are duplicates. It's main drawback is that we have to alter the original table and add a sequential record number field to uniquely identify each record. That script looks like this: |
Click here to copy the following block |
Alter table dup_authors add NewPK int NULL go
declare @intCounter int set @intCounter = 0 update dup_authors SET @intCounter = NewPK = @intCounter + 1
select au_lname, au_fname, city, state, RecCount=count(*), PktoKeep = max(NewPK) into #dupes from dup_authors group by au_lname, au_fname, city, state having count(*) > 1 order by count(*) desc, au_lname, au_fname
delete dup_authors from dup_authors a join #dupes d on d.au_lname = a.au_lname and d.au_fname = a.au_fname and d.city = a.city and d.state = a.state where a.NewPK not in (select PKtoKeep from #dupes)
ALTER TABLE dup_authors DROP COLUMN NewPK go
drop table #dupes |
It's actually possible to combine the SELECT INTO #dupes and the DELETE into one DELETE statement. My script is easier to read and understand and shouldn't be much slower. This will run a single delete statement against your table and only remove the duplicate records. If you have a large table the join statement can get kind of large. I guess that's the price you pay for letting duplicates into your database.
|
|
|
|
Submitted By :
Nayan Patel
(Member Since : 5/26/2004 12:23:06 PM)
|
|
|
Job Description :
He is the moderator of this site and currently working as an independent consultant. He works with VB.net/ASP.net, SQL Server and other MS technologies. He is MCSD.net, MCDBA and MCSE. In his free time he likes to watch funny movies and doing oil painting. |
View all (893) submissions by this author
(Birth Date : 7/14/1981 ) |
|
|