Duplicates

  • I am using the following script but am getting duplicate records for SSN. How can I get rid of duplicates for the field NBR_SSN

    BEGIN

    DECLARE @nbr_energy_year BIGINT

    SET @nbr_energy_year = 2004

    SELECT DISTINCT RIGHT('000000000' + CAST(app.NBR_SSN AS VARCHAR(9)),9) AS NBR_SSN,

    UPPER (app.NAM_FIRST)AS NAM_FIRST,

    UPPER (app.NAM_LAST) AS NAM_LAST,

    app.DTE_DOB

    FROM T_APPLN app WITH(NOLOCK)

    INNER JOIN T_STATUS_APPLN tsa WITH (NOLOCK)

    ON app.CDE_STATUS_APPLN = tsa.CDE_STATUS_APPLN

    WHERE app.NBR_ENERGY_YEAR = @nbr_energy_year

    AND app.CDE_STATUS_APPLN = 2

    AND app.CDE_ACTIVE_FLAG = 1

    GROUP BY app.NBR_SSN,

    app.NAM_FIRST,

    app.NAM_LAST,

    app.DTE_DOB

    END

     

    Sample data...

    NBR_SSN NAM_FIRST NAM_LAST DTE_DOB

    ------------------------------------------------------------

    185900003 ROSE MARRY 1960-04-12

    267500120 BETTY HARRIS 1943-06-28

    621192106 SUSHIEN YONG 1979-09-11

    346670122 DAVIS JOHN 1965-07-10

    087605420 ADAMS WINSTON 1958-06-18

    Here SSN format shud be 9 digits with prefilled zero's

  • It should not give duplicates unless there is a different NAM_FIRST, NAM_LAST or DTE_DOB for a NBR_SSN for the same NBR_SSN.

    Can you post the returned duplicate sample.

    Regards,
    gova

  • And also please confirm that your "sample" data is "generated" data and does not reflect real person's SSNs and birth dates ...

  • 123111111        ADGSDGDSGDSDG      JDJDJNNEEJJSKLS       1987-11-11

    201205455        MARGARET                  GUMP                          1952-01-01

    087605420        ADAMS                         WINSTON                     NULL

    204685650        SADE                           ROSS                           1987-11-18

    345345345        RD                                D                                  1945-02-12

    191200612        DENNIS                        LILLEE                          1976-06-03

    087605420        ADAMS                        WINSTON                     NULL

    123111111        JKS                              EREE                           1976-10-03       

     

    The problem is I should get unique ssn’s irrespective of first name, last name, DOB.

    Even I get same SSN for different FNAME,LNAME,DOB .it is not considered. SSN’S SHOULD BE UNDUPLICATED.

    thanks.

  • >>The problem is I should get unique ssn’s irrespective of first name, last name, DOB.

    The problem is not what you should get. The problem is that you have coded your query to give duplicates if they exist because DISTINCT acts across the entire set of columns.

    You don't want DISTINCT. You want arbitrary or rule-based culling of duplicates. In the case of SSN 123111111 in your sample, which do you want to keep ? The "first' one ? The "most complete" ? The most recently updated/entered ? What are the business rules ?

     

  • I dont mind which SSN shud be used, what i need is just UNIQUE SSN's thats all.

  • >>what i need is just UNIQUE SSN's thats all.

    Then take the other columns out of the resultset.

    If the same SSN has 2 or more different names, that is a data issue that you *should* care about.

     

     

  • what do u mean by take the other columns out.

    I want to display all fields with UNIQUE SSN.

    though I am using DISTINCT for SSN still am getting repeated

  • >>though I am using DISTINCT for SSN still am getting repeated

    You are missing the point. DISTINCT does not work on simply 1 column. It works across all columns in the resultset. Either take the other columns out and select only SSN, or come up with the business rules for which other columns you want when a SSN is duplicated.

     

  • Your problem is not in SQL, your problem is in unclear definition of required result.

    If there are 2 rows with the same SSN, which of them do you want to return?

    123111111        ADGSDGDSGDSDG      JDJDJNNEEJJSKLS       1987-11-11

    123111111        JKS                              EREE                           1976-10-03 

    Both these rows have the same SSN. You want to return SSN 123111111 only once - but at the same time you want to return names. Which of the names do you want, if you can only return one? That's what PW was trying to explain... solution depends on requirements.

    Another possibility is that we misunderstood your question. If you think we did, please explain the whole thing in more detail.

  • While on the topic of business requirements, who is responsible for the source data and would they want to clean up records where there appears to be either incorrect ssn or correct ssn but two different names or birthdates associated with the same ssn?  if app is an abbreviation for application I'm guessing the answer if probably that they need to converge on a single version of this individual by the time an application is accepted.  This is a common administrative issue, and you might discover that the solution is to do something like take the minimum value of each field (using group by ssn instead of distinct to eliminate duplicate), but then run a separate report report which identifies duplicates and feed that report back to the office responsible for entering and managing these applications so they can decide which values are correct (and alert the INS if two people are using the same fake ssn).

    To do this you could run something like

    Select ssn, Min(LastName) as LastName, Min(FirstName) as First Name, Min(DOB) as DOB

    From SourceTable

    Group By ssn

    This will give you one record per ssn.

    To find the records that have more than one value you could run

    Select *

    From SourceTable

    Where snn IN (Select ssn From SourceTable Group By ssn Having Count(*) > 1)

    Order By ssn

    Does any of that make sense in your situation?

     

    Bob

  • >>To do this you could run something like

    >>Select ssn, Min(LastName) as LastName, Min(FirstName) as First Name, Min(DOB) as DOB

    I've seen this done often to eliminate dupes and it is incorrect, because it brings together unrelated attributes. What if John Smith and William Anderson have the same SSN ?

    Your query will show John Anderson as the SSN owner which is just as incorrect and meaningless as having both with the same SSN.

     

  • Actually, when I looked at my second suggestion I realized it was incomplete.  You don't care if they have more than one application as long as they give the same personal information.  So you could do the following to find cases where ssn is attached to differering personal information.

     

    Select Distinct ssn, LastName, FirstName, DOB

    From SourceTable

    Where ssn IN (Select S.ssn

      From (Select Distinct ssn, LastName, FirstName, DOB From Source Table) S

           Group By S.ssn

           Having Count(*) > 1)

    Order by ssn

    What this does is counts the number of ssn values from an embedded subquery that counts distinct rows of {ssn, LastName, FirstName, DOB} and returns the ssn of cases where there is more than one such row.  The outer query then returns the details.

    Hope this is useful.

  • Good point.  But read my entire post.  I'm trying to suggest possible tools for getting out a report while also getting his to tackle the real issue which is that there shouldn't be differing names associated with the same ssn in the source system.

    What this person seems to actually want is an unduplicated list of the applicants from whom apps are received..  If there is bad data in the source system his report will be incorrect regardless of what he does.  So I'm suggesting separating the issue of getting a distinct count from the issue of identifying and cleaning up bad source data.  I'm not in any way meaning to suggest that we know the Min values are the correct values, but if there is an operational need to know how many distinct applicants they have using Min would make sense.  If they are preparing a mailing instead it wouldn't make senes, but he wasn't pulling address data so I asssume to goal is more likely to get a count of unique applicants.  Of course if he is really after a count he could drop the names and DOB out of his query as was already suggested, but then no one would notice and clean up the problems in the source system. 

    I agree with the comments that he first needs to be clearer about the business requirements, and was simply trying to suggest ways in which he could address a couple likely businesss issues.  Given where he was coming from I thought I'd stick as close to how he saw the problem as possible and see if he could take the next step.  After all, this is a newbie post, right?  Maybe he doesn't know how to build relationships with the source system owners and get them to clean up the source data yet.  I'm trying to help him head in that direction.  Maybe he doesn't realize that you can use Group By and Min or Max to build a data table that populates value for every column and creates an unduplicated key column.  These are useful tools to pick up on the way towared learning how to solve data problems in T-SQL.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply