Seek for duplicate rows

Question

Post reply

Seek for duplicate rows

johnsql-193053

SSCommitted

Points: 1657
More actions
February 10, 2007 at 3:35 pm

#176769

Hi,

I read some book that says the following query will seek for duplicate rows in a table. However, the book does not gives me a clear explanation

1/ why the query can do that?

2/ why do we have to use JOIN not INNER JOIN?

Any help would be great appreciated.

--- a row that is in table "compliancehousehold" and is duplicated if

-- its values is specified in the join.

select C1.*

from compliancehousehold C1

join compliancehousehold C2

on C1.reshID = C2.reshID

and C1.reshstartDate = C2.reshstartDate

and C1.ReshHudNextAnnualDate is null

and C1.chid < C2.chid

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Sergiy SSC Guru Points: 110192 More actions · Answer 1

Sergiy

SSC Guru

Points: 110192

February 11, 2007 at 1:34 pm

#689060

1. Open BOL anf find out that JOIN == INNER JOIN.

2. Give yourself a definition what is "duplicate rows".

_____________
Code for TallyGenerator

johnsql-193053 SSCommitted Points: 1657 More actions · Answer 2

Sergiy,

I do not know and ask people on this forum. However, your words make me frustrating because BOL if helped me, I did not need to post my question here. Please stop your writing for this topic if you feel it wastes your time and let others reply. I do not need your words since they do not help me though. I'd like to discuss with others to help me learn new things instead of seeing your words that turn to be nothing. I do not need you. If you do not like to answer, you can ignore, right?

johnsql

Sergiy SSC Guru Points: 110192 More actions · Answer 3

If you don't want to think leave this profession right now. It's not for you.

I stated everything you should get from BOL if you've been taught to read manuals.

JOIN == INNER JOIN.

Probably from the second attempt you can get that this is the answer on one of your questions.

Answer on another question does not require SQL skills.

It's just common sence.

If you cannot describe in plain English what kind on rows you consider duplicated, sorry, I'm afraid nothing and nobody can help you then.

_____________
Code for TallyGenerator

Michael Meierruth SSChampion Points: 10051 More actions · Answer 4

johnsql,

I thought Sergiy was pretty helpful. He saved you a trip to BOL by saying that JOIN is the same as INNER JOIN. In fact, BOL shows [INNER] JOIN which means the word INNER is optional.

Outer joins require LEFT [OUTER] JOIN or RIGHT [OUTER] JOIN.

My personal favorite is:

JOIN

LEFT JOIN

RIGHT JOIN

As for the duplicate problem, your query joins a table to itself which is not necessarily the best way to find duplicates. The next best thing to a "definition of duplicate" that Sergiy asked for is to show some sample data and explain which ones are duplicates. Then we can help you with your SQL query.

johnsql-193053 SSCommitted Points: 1657 More actions · Answer 5

Michael,

Thanks your your words.

The condidions I guess are (as the book writes)

C1.reshID = C2.reshID

and C1.reshstartDate = C2.reshstartDate

and C1.ReshHudNextAnnualDate is null

and C1.chid < C2.chid

To find out duplicate rows. But I am not sure because the book as I told does not give any extra reason why the query can perform finding out duplicate rows. Those are my questions I want to ask. BTW, how many posts to this forum so I can become a "profession" in this forum?

johnsql

Michael Meierruth SSChampion Points: 10051 More actions · Answer 6

What kind of book is this? If it presents a query like this without a detailed description of the table and some sample data it is impossible understand what it is trying to accomplish. It cannot possibly teach anything - much less find duplicate rows.

Try to imagine the data! You must make that effort!

I don't understand what you mean by: "profession" in this forum. Most people, like me, participate in this forum to learn and to help people learn. And there is nothing like learning from the real problems other people encounter. So what is your problem all about?

Rick Ziminski Old Hand Points: 313 More actions · Answer 7

I have this in my notes, (written by someone other than me) but I have used it and varations of it many times:

/*

The first step to a duplicate-free table is coming up with a way of uniquely identifying each row. There's no way you can delete a particular row without being able to distinguish it from the others. The easiest way to accomplish this is to add an identity column to this table and make it the primary key.

ALTER TABLE Payment ADD

PaymentID int NOT NULL IDENTITY(1, 1),

CONSTRAINT PK_Payment PRIMARY KEY(PaymentID)

Now that you have the ability to specify a particular row to delete, you must decide how to define "duplicate". Which columns should uniquely identify a row? In real-world scenarios, the answer will be based upon how the table has been used and the actual data contained in it; however, for this example, a combination of CustomerNumber, PostedDatetime, and PaymentAmt will serve as the logical key. It makes sense that the same customer would not post two payments of exactly the same amount at exactly the same time.

Having decided what a duplicate is, you need some way of identifying them. To identify the duplicate rows, you need to construct a query which returns each unique CustomerNumber / PostedDatetime / PaymentAmt combination where there is more than one row with the same key values.

SELECT

CustomerNumber,

PostedDatetime,

PaymentAmt

FROM

Payment

GROUP BY

CustomerNumber,

PostedDatetime,

PaymentAmt

HAVING

COUNT(*) > 1

This query groups by the key values to return one row per unique set of values. The HAVING clause then filters out all groups where there is only one row for that set of values. This leaves a rowset containing only those logical key values where there are duplicates.

The next problem to solve is how to eliminate all but one row per set of key values. The PaymentID column added earlier provides an answer. If we could somehow get exactly one PaymentID for each set of duplicates, we could delete all rows where the CustomerNumber / PostedDatetime / PaymentAmt is returned in the above query, but the PaymentID does not match. This would delete all but one row per set of duplicates. How can we accomplish this?

When grouping data, aggregate functions can be used to return data about the group as a whole. While some of the aggregate functions such as SUM() and AVG() return values which don't exist on any particular row in the group, the extrema functions MIN() and MAX() return values which exist in an actual row in the group. Hence, the extrema functions can be used to return an arbitrary value from some set of values.

Using this knowledge about MIN() and MAX(), we can modify the above query to return the CustomerNumber / PostedDatetime / PaymentAmt for each set of duplicates along with exactly one unique PaymentID.

SELECT

MAX(PaymentID) AS PaymentID,

CustomerNumber,

PostedDatetime,

PaymentAmt

FROM

Payment

GROUP BY

CustomerNumber,

PostedDatetime,

PaymentAmt

HAVING

COUNT(*) > 1

This query returns all the data we need to perform the delete operation, but how do we tell the database engine to actually delete the records? We can't simply change the SELECT to DELETE because DELETE statements don't support grouping and aggregates. However, the DELETE statement does support joining to other tables. We can use a technique called a derived table to treat the above SELECT statement as if it were a table. We can then write a DELETE statement which joins to this derived table and performs the DELETE.

DELETE FROM

p1

FROM

Payment p1

INNER JOIN

(

SELECT

MAX(PaymentID) AS PaymentID,

CustomerNumber,

PostedDatetime,

PaymentAmt

FROM

Payment

GROUP BY

CustomerNumber,

PostedDatetime,

PaymentAmt

HAVING

COUNT(*) > 1

&nbsp p2

ON(

p1.CustomerNumber = p2.CustomerNumber

AND

p1.PostedDatetime = p2.PostedDatetime

AND

p1.PaymentAmt = p2.PaymentAmt

AND

p1.PaymentID <> p2.PaymentID

&nbsp

The most interesting part of this query is the join criteria. We join on CustomerNumber, PostedDatetime, and PaymentAmt to delete rows that match logically to the set of duplicates that we've found. If we left the criteria at this point, we'd end up deleting the duplicates, but we wouldn't have any copies of the duplicate rows remaining. Our requirements stated that we delete all the duplicates, but leave one copy of each duplicated row. To do this, we add the fourth criterion stating the PaymentID not match one of the PaymentID values in the set of duplicates. Thus, one row per set of duplicates is retained while deleting the others.

Hopefully, you'll never need to use these techniques to delete duplicate data. However, if you do, you won't have to struggle to find a technique that precisely and efficiently performs the task. Even if you don't use this technique to delete duplicates, many of the elements of the solution - derived tables, joined deletes, grouping and aggregation - can be applied to other situations that require inventive query techniques.

*/

Michael Meierruth SSChampion Points: 10051 More actions · Answer 8

This is a great piece of writing and hits the nail on the head.

But let's get back to johnsql's real life problem. In particulare, let's take his query and see if it can find duplicate data by just inventing some data. We can simplify it (without losing the essence of the problem) to just the first and last columns:

select C1.*

from compliancehousehold C1

join compliancehousehold C2 on C1.reshID = C2.reshID and C1.chid < C2.chid

Thus we have two columns and we want to find 'duplicates'.

With the data

1,1

the query returns nothing. It cannot seem to find the duplicate!

With the data

1,1

1,2

the query returns

1,1

so it seems to have found a duplicate!

With the data

1,1

1,2

1,3

the query returns

1,1

1,2

so it finds the duplicates but it seems to find the first duplicate twice. That's because the 1st row gets linked to the second row AND the third row.

Hope this helps.