Using EXCEPT to Determine Row Differences

Question

Post reply

Using EXCEPT to Determine Row Differences

Viewing 15 posts - 46 through 60 (of 72 total)

You must be logged in to reply to this topic. Login to reply

Christian Graus SSC Eights! Points: 963 More actions · Answer 1

Christian Graus

SSC Eights!

Points: 963

April 2, 2014 at 7:15 pm

#1702694

I was going to report the same thing. The article should be updated to add rows to both tables, and to explain the usefulness of the joins, in restricting the except to only compare columns that exist ( as defined by the primary key ) in both tables.

simon.coleman SSC Enthusiast Points: 161 More actions · Answer 2

Hi,

I'm interested to learn whether this generates a high or low-cost queryplan?

have you been comparing the entire dataset with a fresher copy each time, or just checking against a newly loaded subset?

I'd guess that performance ought to be somewhat similar to join performance which forces a table lookup on both sides [to get the column data] ...

clustering both tables over the pkey should be quickest, since ultimately it will have to check all the data across both tables.

what's your performance experience been like?

william.polewchak SSC Rookie Points: 37 More actions · Answer 3

william.polewchak

SSC Rookie

Points: 37

January 9, 2015 at 5:24 am

#1769034

To echo what other posters have said, this is a HUGE win, when comparing data. There have been countless times where I need to do exactly what you're describing, and this method is so unbelievably simple, it made my day.

Cheers!

Stephen Hirsch SSCommitted Points: 1822 More actions · Answer 4

Just wanted to share this trick that I've used with a lot of success over the years:

[font="Courier New"]

SELECT 'In A not B' location,

f1,

f2,

f3

FROM a

EXCEPT

SELECT 'In A not B' location,

f1,

f2,

f3

FROM B

UNION ALL

SELECT 'In B not A' location,

f1,

f2,

f3

FROM B

EXCEPT

SELECT 'In B not A' location,

f1,

f2,

f3

FROM A

UNION ALL

(

SELECT 'In Both' location,

f1,

f2,

f3

FROM B

INTERSECT

SELECT 'In Both' location,

f1,

f2,

f3

FROM A

)

[/font]

tom.w.brannon Hall of Fame Points: 3956 More actions · Answer 5

I have three tables I sometimes need to sync (development, acceptance and production) and use the following sort of query to find all differences. This is not quite what the author wanted which is differences for rows in both/all tables but could easily be modified to do so. Also note that my tables are relatively small and I have not had to worry about performance. Here is a query that compare constraints column usage across instances:

select sum(src) as srcnum,

case sum(src) when 1 then 'D__'

when 2 then '_A_'

when 3 then 'DA_'

when 4 then '__P'

when 5 then 'D_P'

when 6 then '_AP'

when 7 then 'DAP' end as src,

a.TABLE_SCHEMA, a.TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

from (

select 1 src, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

from Dev.information_schema.CONSTRAINT_COLUMN_USAGE

union

select 2 src, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

from Acc.information_schema.CONSTRAINT_COLUMN_USAGE

union

select 4 src, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

from Pro.information_schema.CONSTRAINT_COLUMN_USAGE

) a

group by data_grp, a.TABLE_SCHEMA, a.TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

having sum(src) < 7

order by data_grp, a.TABLE_SCHEMA, a.TABLE_NAME, COLUMN_NAME, CONSTRAINT_NAME, sum(src);

For data table I typically simplify the inner union queries as 'select 1 src, *'. I know use of * is often discouraged. However, if it bombs because my columns are not in the same order or of different data types, that can be useful information as well.

tspade SSC Veteran Points: 264 More actions · Answer 6

victordba (3/26/2014)

Thank you, @james.jensen1350 for your reply.

I've been using MERGE for many years to do the upsert. And my source and target tables are large, with millions of records. So, MERGE takes it's sweet time to compare large tables to find the differences between those tables and to do upsert. But MERGE is not perfect, and you have to be very careful with it and also not to forget to take case of NULLs in every field. So, I am always on the lookout for some interesting and more efficient ways to do my upserts. And this article got my attention. I did some quick test comparisons between MERGE and EXCEPT to find changes between 2 tables with about 10 million records in each, and EXCEPT took about 35 seconds, while MERGE about 7 minutes. But I just ran EXCEPT once, to find changes in matching records. I would have to do a full test to also find new and missing records, so I would have to run EXCEPT 2 more times, and also run INSERT, UPDATE and DELETE commands when I find those differences. So, more results to follow. I was just wondering if anyone else tried some other ways to do UPSERTs besides MERGE command on large tables.

@victordba - Did you ever prove whether the all-inclusive MERGE statement performed better or worse than all the separate statements needed when using the EXCEPT approach?

sqlr Grasshopper Points: 22 More actions · Answer 7

Regarding the query optimization, only the first join is needed, it ensures you are only taking the Source items that are also found on Target. The second one is not necessary.

Actually, on this situation you could even avoid using joins and table aliases:

SELECT ID

, [Item]

, [Price]

, [OrderDate]

, [Units]

, [ShipmentDate]

FROM [#Source]

WHERE [ID] IN (SELECT [ID] FROM #Target)

EXCEPT

SELECT [ID]

, [Item]

, [Price]

, [OrderDate]

, [Units]

, [ShipmentDate]

FROM [#Target]

RonKyle SSC-Dedicated Points: 31482 More actions · Answer 8

I don't know that where I could use the MERGE I would use EXCEPT. I've used EXCEPT a couple of times on permanent queries. The first was to take a list of thousands of phone numbers and add only the new ones to a list containing millions of phone numbers. The insert performed so quickly that I saw no need to try to find a different way to do it.

I used EXCEPT to do the same thing and also found it very fast. This produced a source recordset that was then inserted into the table. As I only needed to insert as you can't update a phone number (a person's phone number yes, but not an actual phone number) I can't think that a MERGE would provide any advantage.

david 20841 Newbie Points: 3 More actions · Answer 9

I have been using EXCEPT in ETL MERGE statements because it handles NULL values without the need for picking a suitable COALESCE default (TIME datatypes ?). I join the source and target rows using their primary key, and perform an EXISTS on the result of the EXCEPT statement between source and target. However I have found that for large tables, EXCEPT becomes less efficient. I can only assume this is because EXCEPT always compares all the column pairs in the the two rows, whereas I suspect individual ...OR src.ColA <> tgt.ColA... comparisons can return a result at the first TRUE condition without making any further comparisons.

barry.mcconnell Default port Points: 1447 More actions · Answer 10

Very nice. By using LEFT JOIN rather than INNER, the example also returns the unmatched rows.

DELETE FROM #Target WHERE Item = 'Fig';



GO

INSERT INTO #Source

        ( [Item], [Price], [OrderDate], [Units], [ShipmentDate] )

    VALUES  ( 'Prune', 2.49, '1/1/2001', NULL, '1/02/2001' );

GO

SELECT .[ID]

      , .[Item]

      , .[Price]

      , .[OrderDate]

      , .[Units]

      , .[ShipmentDate]

    FROM [#Source] 

        LEFT JOIN [#Target] [T]

        ON .[ID] = [T].[ID]

EXCEPT

SELECT [T].[ID]

      , [T].[Item]

      , [T].[Price]

      , [T].[OrderDate]

      , [T].[Units]

      , [T].[ShipmentDate]

    FROM [#Target] [T]

        LEFT JOIN [#Source] 

        ON [T].[ID] = .[ID];

returns 5 rows rather than the 3 from the example.

There are no facts, only interpretations.
Friedrich Nietzsche

h.tobisch SSCommitted Points: 1671 More actions · Answer 11

h.tobisch

SSCommitted

Points: 1671

January 9, 2015 at 8:15 am

#1769099

I use this approach :

MERGE statement and comparing columns with SET ANSI NULLS OFF.

Any objections against using SET ANSI NULLS OFF ?

TheGreenShepherd SSCertifiable Points: 6979 More actions · Answer 12

TheGreenShepherd

SSCertifiable

Points: 6979

January 9, 2015 at 9:22 am

#1769129

What do you think about using this in a trigger to find what has actually changed during a DML statement?

t.ovod-everett Default port Points: 1407 More actions · Answer 13

I scanned through the articles, but I may have missed it if someone discussed the "proper" way to do upserts with MERGE. I recommend http://www.made2mentor.com/2013/05/writing-t-sql-merge-statements-the-right-way/ , and for a lot of interesting details on query processor operators that aren't explicitly available in T-SQL you should read http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx . The upshot is that NOT EXISTS (SELECT S.* INTERSECT T.*) is a good way to check for row equivalence in a MERGE.

When making changes to stored procedures or complicated queries, I frequently want to ensure that I'm not changing the result sets. In these situations, I'm not worried about duplicate rows, so EXCEPT works well. I generally use the following syntax for these sort of "diff" queries:



WITH

DSO AS (

	SELECT *

	FROM #Old

),

DSN AS (

	SELECT *

	FROM #New

)

SELECT 'Old' AS Tbl, * FROM (SELECT * FROM DSO EXCEPT SELECT * FROM DSN) AS U1

UNION

SELECT 'New' AS Tbl, * FROM (SELECT * FROM DSN EXCEPT SELECT * FROM DSO) AS U2

ORDER BY foo, Tbl;

I use the above structure because if I need to ignore certain columns, I can do so in one place in the CTEs rather than having to repeat myself in the pair if EXCEPT sub-queries. In the ORDER BY, "foo" should be replaced by the candidate key columns to get changes to pair up nicely.

cwhite-735073 Valued Member Points: 59 More actions · Answer 14

cwhite-735073

Valued Member

Points: 59

January 9, 2015 at 5:09 pm

#1769218

Does anyone have any information on the performance impact of using this method to create type 2 slowly changing dimensions versus using a merge statement?

DataTherapist Mr or Mrs. 500 Points: 510 More actions · Answer 15

I'm with the others that don't see a need for the join

the query

SELECT * FROM dbo.source

except

SELECT * FROM dbo.target

gives me the same results! so? ...

I was told to add a row to Source and try again...

the extra row shows up in the results.... that is what you would want, yes/no?

ID Item Price OrderDate Units ShipmentDate

2 Coconut 0.99 2003-03-03 1.3500 NULL

3 Eggplant 22.44 2005-05-05 NULL 2005-05-06

6 Lychee 0.29 2012-12-12 1000.5500 2012-12-14

7 GrApple 2.22 2001-01-01 NULL 2001-01-02 <--this row in Source, that does not exist in Target showed up.

So the inner join, does remove the GrApple row.

example:

SELECT * FROM dbo.source INNER JOIN dbo.target ON dbo.source.ID=dbo.target.id

except

SELECT * FROM dbo.target INNER JOIN dbo.source ON dbo.source.ID=dbo.target.id

returns just the 3 rows:

2 Coconut 0.99 2003-03-03 1.3500 NULL

3 Eggplant 22.44 2005-05-05 NULL 2005-05-06

6 Lychee 0.29 2012-12-12 1000.5500 2012-12-14

so the point is that you are only looking for rows that are DIFFERENT, that should be the same, excluding rows with new ID's which have been produced in the production environment.... but you are still able to verify that your copy is accurate, as far as what you have.....

yes?

Thanks!

Peter A.R. Johnson

DBA / Data Therapist / Federal Way Public Schools