Duplicates

Question

Post reply

Duplicates

Mike Levan

SSChampion

Points: 14201
More actions
May 25, 2006 at 8:29 am

#115228

I am using the following script but am getting duplicate records for SSN. How can I get rid of duplicates for the field NBR_SSN

BEGIN

DECLARE @nbr_energy_year BIGINT

SET @nbr_energy_year = 2004

SELECT DISTINCT RIGHT('000000000' + CAST(app.NBR_SSN AS VARCHAR(9)),9) AS NBR_SSN,

UPPER (app.NAM_FIRST)AS NAM_FIRST,

UPPER (app.NAM_LAST) AS NAM_LAST,

app.DTE_DOB

FROM T_APPLN app WITH(NOLOCK)

INNER JOIN T_STATUS_APPLN tsa WITH (NOLOCK)

ON app.CDE_STATUS_APPLN = tsa.CDE_STATUS_APPLN

WHERE app.NBR_ENERGY_YEAR = @nbr_energy_year

AND app.CDE_STATUS_APPLN = 2

AND app.CDE_ACTIVE_FLAG = 1

GROUP BY app.NBR_SSN,

app.NAM_FIRST,

app.NAM_LAST,

app.DTE_DOB

END

Sample data...

NBR_SSN NAM_FIRST NAM_LAST DTE_DOB

------------------------------------------------------------

185900003 ROSE MARRY 1960-04-12

267500120 BETTY HARRIS 1943-06-28

621192106 SUSHIEN YONG 1979-09-11

346670122 DAVIS JOHN 1965-07-10

087605420 ADAMS WINSTON 1958-06-18

Here SSN format shud be 9 digits with prefilled zero's

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Roy Rogers SSCertifiable Points: 7712 More actions · Answer 1

Roy Rogers

SSCertifiable

Points: 7712

May 25, 2006 at 8:39 am

#640066

It should not give duplicates unless there is a different NAM_FIRST, NAM_LAST or DTE_DOB for a NBR_SSN for the same NBR_SSN.

Can you post the returned duplicate sample.

Regards,
gova

PW-201837 SSC-Insane Points: 20805 More actions · Answer 2

PW-201837

SSC-Insane

Points: 20805

May 25, 2006 at 8:58 am

#640077

And also please confirm that your "sample" data is "generated" data and does not reflect real person's SSNs and birth dates ...

Mike Levan SSChampion Points: 14201 More actions · Answer 3

123111111 ADGSDGDSGDSDG JDJDJNNEEJJSKLS 1987-11-11

201205455 MARGARET GUMP 1952-01-01

087605420 ADAMS WINSTON NULL

204685650 SADE ROSS 1987-11-18

345345345 RD D 1945-02-12

191200612 DENNIS LILLEE 1976-06-03

087605420 ADAMS WINSTON NULL

123111111 JKS EREE 1976-10-03

The problem is I should get unique ssn’s irrespective of first name, last name, DOB.

Even I get same SSN for different FNAME,LNAME,DOB .it is not considered. SSN’S SHOULD BE UNDUPLICATED.

thanks.

PW-201837 SSC-Insane Points: 20805 More actions · Answer 4

>>The problem is I should get unique ssn’s irrespective of first name, last name, DOB.

The problem is not what you should get. The problem is that you have coded your query to give duplicates if they exist because DISTINCT acts across the entire set of columns.

You don't want DISTINCT. You want arbitrary or rule-based culling of duplicates. In the case of SSN 123111111 in your sample, which do you want to keep ? The "first' one ? The "most complete" ? The most recently updated/entered ? What are the business rules ?

Mike Levan SSChampion Points: 14201 More actions · Answer 5

Mike Levan

SSChampion

Points: 14201

May 25, 2006 at 9:29 am

#640093

I dont mind which SSN shud be used, what i need is just UNIQUE SSN's thats all.

PW-201837 SSC-Insane Points: 20805 More actions · Answer 6

PW-201837

SSC-Insane

Points: 20805

May 25, 2006 at 9:35 am

#640096

>>what i need is just UNIQUE SSN's thats all.

Then take the other columns out of the resultset.

If the same SSN has 2 or more different names, that is a data issue that you *should* care about.

Mike Levan SSChampion Points: 14201 More actions · Answer 7

Mike Levan

SSChampion

Points: 14201

May 25, 2006 at 9:42 am

#640102

what do u mean by take the other columns out.

I want to display all fields with UNIQUE SSN.

though I am using DISTINCT for SSN still am getting repeated

PW-201837 SSC-Insane Points: 20805 More actions · Answer 8

>>though I am using DISTINCT for SSN still am getting repeated

You are missing the point. DISTINCT does not work on simply 1 column. It works across all columns in the resultset. Either take the other columns out and select only SSN, or come up with the business rules for which other columns you want when a SSN is duplicated.

Vladan SSC-Insane Points: 22072 More actions · Answer 9

Your problem is not in SQL, your problem is in unclear definition of required result.

If there are 2 rows with the same SSN, which of them do you want to return?

123111111 ADGSDGDSGDSDG JDJDJNNEEJJSKLS 1987-11-11

123111111 JKS EREE 1976-10-03

Both these rows have the same SSN. You want to return SSN 123111111 only once - but at the same time you want to return names. Which of the names do you want, if you can only return one? That's what PW was trying to explain... solution depends on requirements.

Another possibility is that we misunderstood your question. If you think we did, please explain the whole thing in more detail.

rduniway SSC-Addicted Points: 409 More actions · Answer 10

While on the topic of business requirements, who is responsible for the source data and would they want to clean up records where there appears to be either incorrect ssn or correct ssn but two different names or birthdates associated with the same ssn? if app is an abbreviation for application I'm guessing the answer if probably that they need to converge on a single version of this individual by the time an application is accepted. This is a common administrative issue, and you might discover that the solution is to do something like take the minimum value of each field (using group by ssn instead of distinct to eliminate duplicate), but then run a separate report report which identifies duplicates and feed that report back to the office responsible for entering and managing these applications so they can decide which values are correct (and alert the INS if two people are using the same fake ssn).

To do this you could run something like

Select ssn, Min(LastName) as LastName, Min(FirstName) as First Name, Min(DOB) as DOB

From SourceTable

Group By ssn

This will give you one record per ssn.

To find the records that have more than one value you could run

Select *

From SourceTable

Where snn IN (Select ssn From SourceTable Group By ssn Having Count(*) > 1)

Order By ssn

Does any of that make sense in your situation?

Bob

PW-201837 SSC-Insane Points: 20805 More actions · Answer 11

>>To do this you could run something like

>>Select ssn, Min(LastName) as LastName, Min(FirstName) as First Name, Min(DOB) as DOB

I've seen this done often to eliminate dupes and it is incorrect, because it brings together unrelated attributes. What if John Smith and William Anderson have the same SSN ?

Your query will show John Anderson as the SSN owner which is just as incorrect and meaningless as having both with the same SSN.

rduniway SSC-Addicted Points: 409 More actions · Answer 12

Actually, when I looked at my second suggestion I realized it was incomplete. You don't care if they have more than one application as long as they give the same personal information. So you could do the following to find cases where ssn is attached to differering personal information.

Select Distinct ssn, LastName, FirstName, DOB

From SourceTable

Where ssn IN (Select S.ssn

From (Select Distinct ssn, LastName, FirstName, DOB From Source Table) S

Group By S.ssn

Having Count(*) > 1)

Order by ssn

What this does is counts the number of ssn values from an embedded subquery that counts distinct rows of {ssn, LastName, FirstName, DOB} and returns the ssn of cases where there is more than one such row. The outer query then returns the details.

Hope this is useful.

rduniway SSC-Addicted Points: 409 More actions · Answer 13

Good point. But read my entire post. I'm trying to suggest possible tools for getting out a report while also getting his to tackle the real issue which is that there shouldn't be differing names associated with the same ssn in the source system.

What this person seems to actually want is an unduplicated list of the applicants from whom apps are received.. If there is bad data in the source system his report will be incorrect regardless of what he does. So I'm suggesting separating the issue of getting a distinct count from the issue of identifying and cleaning up bad source data. I'm not in any way meaning to suggest that we know the Min values are the correct values, but if there is an operational need to know how many distinct applicants they have using Min would make sense. If they are preparing a mailing instead it wouldn't make senes, but he wasn't pulling address data so I asssume to goal is more likely to get a count of unique applicants. Of course if he is really after a count he could drop the names and DOB out of his query as was already suggested, but then no one would notice and clean up the problems in the source system.

I agree with the comments that he first needs to be clearer about the business requirements, and was simply trying to suggest ways in which he could address a couple likely businesss issues. Given where he was coming from I thought I'd stick as close to how he saw the problem as possible and see if he could take the next step. After all, this is a newbie post, right? Maybe he doesn't know how to build relationships with the source system owners and get them to clean up the source data yet. I'm trying to help him head in that direction. Maybe he doesn't realize that you can use Group By and Min or Max to build a data table that populates value for every column and creates an unduplicated key column. These are useful tools to pick up on the way towared learning how to solve data problems in T-SQL.