How to performance tune this Merge statement for our DW ETL process?

Question

Post reply

How to performance tune this Merge statement for our DW ETL process?

boxta

Ten Centuries

Points: 1374
More actions
May 17, 2012 at 3:46 pm

#274242

Hi all,

I have this relatively simple (logic-wise) merge statement which we do every day to update the datawarehouse's dimension / fact tables from the raw ODS table, but it's taking so long that I would like to have some new idea on how to tune it better..

the general syntax of the merge statement is simply:

MERGE ZooskDataWarehouse.dbo.DimUser AS target

USING OdsUsers AS source

ON target.UserID = source.UserID

WHEN MATCHED AND target.LastUpdated < source.LastUpdated THEN

UPDATE SET xxxxxx

WHEN NOT MATCHED BY TARGET THEN

INSERT XXXXX ;

Basically, each day production (web site) would send DW (datawarehouse) all the users that have updated their account. I will then merge all the data into our ODS table (which stores all the user's data in its raw form)

then I merge ods to both DimUser (user dimension) and FactUser (User fact table).

Quite simple in logic. But the problem is the volume.

We have about 120 million rows of user data (so 120MM row for ODS, DIM and FACT user tables)

Each day, about 7 million rows of user data changes. Let's say most of them is modified existing user data that go through the update path.

The SQL itself is taking about an hour to run, even on fast CPU and SSD RAIDed disks, and i actually about 1/3 of the time of the over all process.

How can I make it more efficient?

What should I set my cluster index? (right now I only have userID, and a non cluster index with userid and lastupdated date)

Should I set up partitioned tables?? (problem is.. all users have a chance of being updated, so you can't really partition by time)

I already have the database set up with multiple files/filegroup

During the ETL, I can see that CPU / Memory utilization is very low. Disk usage is there, but I can't say it's peaked as well. So I suspect there's a lot of room for improvement in terms of efficiency

Another thought is to have a where clause on the SOURCE part of the MERGE. For example (select * from ODSUsers where LastUpdated > getdate() - 1)

Logically it makes sense because then I am only merging the newly updated users to dim. However, it actually turns out to be slow, mostly because the where clause is so inefficient

I also thought I would do a cluster index on last updated, so this select would be quick. But the index management become very inefficient, because of all the shuffling of the data when it's updated.

Any suggestions on how I can do better ? thanks

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

SpringTownDBA SSCertifiable Points: 6806 More actions · Answer 1

Without the query plan its hard to give firm advice.

I would definately try reducing the rows going into the merge statement:



;WITH UpdatedOdsUsers as

    (select * from OdsUsers

     where LastUpdated > dateadd(dd, -1, getdate())

     )

 MERGE ZooskDataWarehouse.dbo.DimUser AS target

      USING UpdatedOdsUsers   AS source

      ON target.UserID = source.UserID

      WHEN MATCHED AND target.LastUpdated < source.LastUpdated THEN

              UPDATE SET xxxxxx

        WHEN NOT MATCHED BY TARGET THEN

      INSERT XXXXX ;

You may want to split the MERGE back into a two-step process, as it's easier the tackle the performance of each individually.

MERGE is handy, but hard to tune.



INSERT  DimUser (UserId, c1, c2, c3) select UserId, '', '', 0 --defaults!

from OdsUser where userId not in (select UserId from DimUser)

--Optimize the insert by having non-clustered indexes

--on UserId on both tables. By inserting default values, IO reads

--are greatly reduced.

UPDATE DimUser SET XXXXXX FROM DimUser du INNER JOIN OdsUser ou on ou.UserId = du.UserId WHERE du.LastUpdated < ou.LastUpdated and ou.LastUpdated > datadd(dd, -1, getdate())

How many other indexes are on these tables? It may be quicker to drop/recreate the non-clustered indexes vs leaving them intact.

If you can post the plan, it would really help.

thanks!

Gullimeel Hall of Fame Points: 3761 More actions · Answer 2

120 million rows are quite a huge number. Have you ever thought of parttioning the table?

say e.g. you have around 8 parttiton Then you could start say 8 session with each session inserting the 900k rows in each partition based on partition key.

Even if you drop indexes to load it faster.Recreating indexes on partitioned tables should be fast..

But again see whether partitioning is suited to you and provide the perfroamnce you are looking.

GulliMeel

Finding top n Worst Performing queries[/url]
Improve the performance of Merge Join(special case)
How to Post Performance Problem -Gail Shaw[/url]

Grant Fritchey SSC Guru Points: 397721 More actions · Answer 3

You'd need to show the actual execution plan in order to make any guesses at what's needed to tune the query. One suggestion, see if there are ways to limit the initial data set to only the data that has been touched over the preceding 24 hours. Then the MERGE will be working with less data. That'll help. Otherwise, no guesses without some idea of the structure and how the query is executing.

----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 4

The problem very likely has nothing to do with the number of rows being compared. It's probably the number of rows being updated.

I don't have a definitive test to show you but I did in the past. On my test box, it took scant seconds (like 4) to update a million rows in a much larger table. When I doubled the number of rows, it only doubled the amount of time, just like expected. But, when I tripled the number of rows, instead of it only taking 3 times linger (12 seconds in this case), it took more than 20 minutes.

I call these things "tipping points" and every server has them. The size may vary, but they all have them.

My recommendation would be to breakup the MERGE so that you're only working with a million rows at a time. 7 updates of a million rows each in a total of 28 seconds (a guess, for sure) is going to be a lot better than your hour long run.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

mtassin SSC-Insane Points: 23099 More actions · Answer 5

You say ETL... are you using SSIS for the ETL?

If so, you can sometimes outperform merge by using the constructs within SSIS to mimic a merge.

Using a lookup to see if the record already exists... if it doesn't do an insert....

Do a second lookup to see if the record has changed... if it has... do an update....

else just toss out the record.

With my larger DW tables (250 Million rows) that seems to give better performance than the merge, I suspect memory pressure, but our DW server is always under memory pressure these days (looking forward to our nice new upgrade).

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 6

mtassin (5/18/2012)

With my larger DW tables (250 Million rows) that seems to give better performance than the merge...

A race would be interesting.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions