inner join issue

Question

Post reply

inner join issue

eseosaoregie

SSCertifiable

Points: 6889
More actions
November 6, 2008 at 5:21 am

#72977

I have written the follow query in sql server to match nhs numbers to various patients in a table, zzWembley WIC 2nd QTR 2008 table

SELECT DISTINCT a.NHS_NUMBER, a.DATE_OF_BIRTH, a.POSTCODE, a.FORENAME,

a.SURNAME, b.[Name]

FROM NSTS_PATIENT a INNER JOIN [zzWembley WIC 2nd QTR 2008] b

ON a.DATE_OF_BIRTH = b.[Date Of Birth] AND a.POSTCODE = b.[Post Code]

Whilst this query matches some patients I think I may be missing a few records because the unique identifier for each patient is: [name][Date of Birth] and [Post code]. However I can not add name as a criteria for ON-Clause because of 2 reasons. Firstly the [name] field in one table combines both first and second name whilst the other has one field for forename and another field for surname. Secondly the names may be spelt incorrectly despite being the same patient. How can overcome this problem.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Garadin One Orange Chip Points: 29613 More actions · Answer 1

Do you not have any unique identification field for the patients? Usually in cases such as these, you would have a unique identifier such as PatientID, or PatientNumber that would be the same across tables.

If the answer to that is no, and your data is bad in the only fields that identify the patients, then you need to fix your data. Fixing data like this is usually done i several passes, and for sensitive data such as yours might have to get actual records checked. I'd start by grouping by postal code and DOB, and showing anything with more than one FirstName / Lastname combination over the tables. I'd then do WHERE NOT IN searches both ways on patient names, with and without linking criteria. Basically, do anything you can think of to possibly show where the data might be different. Don't ignore postal codes and DOB's once you fix all the names, because those being wrong will cause the same issues. If you can get this data all cleaned up, you should:

A. Put policies in place to prevent people from putting bad data into the tables

B. Add a unique identifier for each patient so that you have a way to link them even if the name is wrong (preferred).

C. If you can't do either of the first 2, set up all of the checks you did to find all the bad data in the first place in a task which runs every day and sends you an email, or inserts bad lines into a table, or sets a flag on the table... something to know when a row is possibly bad so it can be corrected before your data gets heavily corrupted again.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

eseosaoregie SSCertifiable Points: 6889 More actions · Answer 2

It is just a once of thing that i was given to do. I imported the data from an excel file. The reference table contains the correct data i.e names and nhs numbers. However I didn't quite get what you meant by :

grouping by postal code and DOB, and showing anything with more than one FirstName / Lastname combination over the tables.

I'd then do WHERE NOT IN searches both ways on patient names, with and without linking criteria.

Could you just explain that with a brief example. Sorry to be a pain in the arse.

Garadin One Orange Chip Points: 29613 More actions · Answer 3

I'm referring to basic data cleansing. For example

The first query will show you that Michael Jordan has a name Mismatch with Mike Jordan based on DOB and Zip Code. This may be a name mismatch, or it may simply be another patient with matching DOB and Zip codes. The frequency of the latter should be low. The second query will show you which patients have names / zip codes in common with different DOB's.

[font="Courier New"]------------------- SETUP ---------------------------------



DECLARE @A TABLE(

Name   VARCHAR(40),

Zip    INT,

DOB    DATETIME)

DECLARE @B TABLE(

FirstName  VARCHAR(20),

LastName   VARCHAR(20),

Zip    INT,

DOB    DATETIME)

INSERT INTO @A(Name, Zip, DOB)

SELECT 'Michael Jordan',12345,'12/05/68' UNION ALL

SELECT 'Magic Johnson',12543,'12/05/69' UNION ALL

SELECT 'Larry Bird',12243,'12/05/70'

INSERT INTO @B(FirstName, LastName, Zip, DOB)

SELECT 'Mike','Jordan',12345,'12/05/68' UNION ALL

SELECT 'Magic','Johnson',12543,'12/05/69' UNION ALL

SELECT 'Larry','Bird',12243,'12/05/71'

-----------------------------------------------------------

------------- Find Mismatched Names -----------------------

SELECT A.Name, B.Name

FROM @A A

   INNER JOIN (  SELECT FirstName + ' ' + LastName Name, Zip, DOB

           FROM @B) B

       ON A.Zip = B.Zip AND A.DOB = B.DOB

WHERE A.Name <> B.Name

GROUP BY A.Name, B.Name

-------------------- RESULTS ------------------------------

--A.Name		B.Name

--Michael Jordan	Mike Jordan

-----------------------------------------------------------

------------- Find Mismatched DOB's -----------------------

SELECT A.Name, A.DOB, B.Name, B.DOB

FROM @A A

   INNER JOIN (  SELECT FirstName + ' ' + LastName Name, Zip, DOB

           FROM @B) B

       ON A.Zip = B.Zip AND A.Name = B.Name

WHERE A.DOB <> B.DOB

GROUP BY A.Name, B.Name, A.DOB, B.DOB

-------------------- RESULTS ------------------------------

--A.Name	A.DOB			B.Name		B.DOB

--Larry Bird	1970-12-05 00:00:00.000	Larry Bird	1971-12-05 00:00:00.000

-----------------------------------------------------------

[/font]

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

eseosaoregie SSCertifiable Points: 6889 More actions · Answer 4

eseosaoregie

SSCertifiable

Points: 6889

November 6, 2008 at 8:09 am

#894910

Thank you very much. I think i should be able to work it from there. That is makes a lot of sense.

Garadin One Orange Chip Points: 29613 More actions · Answer 5

Garadin

One Orange Chip

Points: 29613

November 6, 2008 at 8:12 am

#894915

No problem, glad we could help.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

eseosaoregie SSCertifiable Points: 6889 More actions · Answer 6

eseosaoregie

SSCertifiable

Points: 6889

November 6, 2008 at 9:55 am

#895038

Just one more thing. I was just playing around with script similar to yours for finding mismatched names. when I use the GROUP BY clause I get 1491 records. When I leave it out I get 1495 records. I just wanted to know why the GROUP BY clause did not return the extra 4 records. What specfically is the GROUP BY clause doing.

Garadin One Orange Chip Points: 29613 More actions · Answer 7

Group by is normally used for aggregates. To be honest, I don't need it in those queries, I just wrote them in a hurry and had originally planned on doing a count. I deleted the count at the last minute and forgot to remove the group by. In your instance, group by is basically removing your duplicates.

Example of how it works:

Customer Sales

1 10.00

1 20.00

2 25.00

2 10.00

SELECT Customer, SUM(Sales)

FROM MyTable

GROUP BY Customer

RESULT

Customer Sales

1 30.00

2 35.00

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]