inner join issue

  • I have written the follow query in sql server to match nhs numbers to various patients in a table, zzWembley WIC 2nd QTR 2008 table

    SELECT DISTINCT a.NHS_NUMBER, a.DATE_OF_BIRTH, a.POSTCODE, a.FORENAME,

    a.SURNAME, b.[Name]

    FROM NSTS_PATIENT a INNER JOIN [zzWembley WIC 2nd QTR 2008] b

    ON a.DATE_OF_BIRTH = b.[Date Of Birth] AND a.POSTCODE = b.[Post Code]

    Whilst this query matches some patients I think I may be missing a few records because the unique identifier for each patient is: [name][Date of Birth] and [Post code]. However I can not add name as a criteria for ON-Clause because of 2 reasons. Firstly the [name] field in one table combines both first and second name whilst the other has one field for forename and another field for surname. Secondly the names may be spelt incorrectly despite being the same patient. How can overcome this problem.

  • Do you not have any unique identification field for the patients? Usually in cases such as these, you would have a unique identifier such as PatientID, or PatientNumber that would be the same across tables.

    If the answer to that is no, and your data is bad in the only fields that identify the patients, then you need to fix your data. Fixing data like this is usually done i several passes, and for sensitive data such as yours might have to get actual records checked. I'd start by grouping by postal code and DOB, and showing anything with more than one FirstName / Lastname combination over the tables. I'd then do WHERE NOT IN searches both ways on patient names, with and without linking criteria. Basically, do anything you can think of to possibly show where the data might be different. Don't ignore postal codes and DOB's once you fix all the names, because those being wrong will cause the same issues. If you can get this data all cleaned up, you should:

    A. Put policies in place to prevent people from putting bad data into the tables

    B. Add a unique identifier for each patient so that you have a way to link them even if the name is wrong (preferred).

    C. If you can't do either of the first 2, set up all of the checks you did to find all the bad data in the first place in a task which runs every day and sends you an email, or inserts bad lines into a table, or sets a flag on the table... something to know when a row is possibly bad so it can be corrected before your data gets heavily corrupted again.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • It is just a once of thing that i was given to do. I imported the data from an excel file. The reference table contains the correct data i.e names and nhs numbers. However I didn't quite get what you meant by :

    grouping by postal code and DOB, and showing anything with more than one FirstName / Lastname combination over the tables.

    I'd then do WHERE NOT IN searches both ways on patient names, with and without linking criteria.

    Could you just explain that with a brief example. Sorry to be a pain in the arse.

  • I'm referring to basic data cleansing. For example

    The first query will show you that Michael Jordan has a name Mismatch with Mike Jordan based on DOB and Zip Code. This may be a name mismatch, or it may simply be another patient with matching DOB and Zip codes. The frequency of the latter should be low. The second query will show you which patients have names / zip codes in common with different DOB's.

    [font="Courier New"]------------------- SETUP ---------------------------------

    DECLARE @A TABLE(

    Name   VARCHAR(40),

    Zip    INT,

    DOB    DATETIME)

    DECLARE @B TABLE(

    FirstName  VARCHAR(20),

    LastName   VARCHAR(20),

    Zip    INT,

    DOB    DATETIME)

    INSERT INTO @A(Name, Zip, DOB)

    SELECT 'Michael Jordan',12345,'12/05/68' UNION ALL

    SELECT 'Magic Johnson',12543,'12/05/69' UNION ALL

    SELECT 'Larry Bird',12243,'12/05/70'

    INSERT INTO @B(FirstName, LastName, Zip, DOB)

    SELECT 'Mike','Jordan',12345,'12/05/68' UNION ALL

    SELECT 'Magic','Johnson',12543,'12/05/69' UNION ALL

    SELECT 'Larry','Bird',12243,'12/05/71'

    -----------------------------------------------------------

    ------------- Find Mismatched Names -----------------------

    SELECT A.Name, B.Name

    FROM @A A

       INNER JOIN (  SELECT FirstName + ' ' + LastName Name, Zip, DOB

               FROM @B) B

           ON A.Zip = B.Zip AND A.DOB = B.DOB

    WHERE A.Name <> B.Name

    GROUP BY A.Name, B.Name

    -------------------- RESULTS ------------------------------

    --A.Name B.Name

    --Michael Jordan Mike Jordan

    -----------------------------------------------------------

    ------------- Find Mismatched DOB's -----------------------

    SELECT A.Name, A.DOB, B.Name, B.DOB

    FROM @A A

       INNER JOIN (  SELECT FirstName + ' ' + LastName Name, Zip, DOB

               FROM @B) B

           ON A.Zip = B.Zip AND A.Name = B.Name

    WHERE A.DOB <> B.DOB

    GROUP BY A.Name, B.Name, A.DOB, B.DOB

    -------------------- RESULTS ------------------------------

    --A.Name A.DOB B.Name B.DOB

    --Larry Bird 1970-12-05 00:00:00.000 Larry Bird 1971-12-05 00:00:00.000

    -----------------------------------------------------------

    [/font]

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Thank you very much. I think i should be able to work it from there. That is makes a lot of sense.

  • No problem, glad we could help.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Just one more thing. I was just playing around with script similar to yours for finding mismatched names. when I use the GROUP BY clause I get 1491 records. When I leave it out I get 1495 records. I just wanted to know why the GROUP BY clause did not return the extra 4 records. What specfically is the GROUP BY clause doing.

  • Group by is normally used for aggregates. To be honest, I don't need it in those queries, I just wrote them in a hurry and had originally planned on doing a count. I deleted the count at the last minute and forgot to remove the group by. In your instance, group by is basically removing your duplicates.

    Example of how it works:

    Customer Sales

    1 10.00

    1 20.00

    2 25.00

    2 10.00

    SELECT Customer, SUM(Sales)

    FROM MyTable

    GROUP BY Customer

    RESULT

    Customer Sales

    1 30.00

    2 35.00

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply