LEAP and LAG behind the scenes

Question

Post reply

LEAP and LAG behind the scenes

rs80

SSCrazy

Points: 2312
More actions
December 23, 2015 at 7:50 am

#310436

I read an article that LEAP and LAG reduces the efforts of writing code to do a self join in order to get data from previous or subsequent records. However, behind the scenes, is it doing a self join?

Thanks!

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286562 More actions · Answer 1

rs80 (12/23/2015)

I read an article that LEAP and LAG reduces the efforts of writing code to do a self join in order to get data from previous or subsequent records. However, behind the scenes, is it doing a self join?

Thanks!

Not knowing what the source code looks like it is hard to say but it certainly isn't a self join. There are performance improvements using LEAD/LAG over the more traditional self join so it is obviously some other mechanism.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

rs80 SSCrazy Points: 2312 More actions · Answer 2

rs80

SSCrazy

Points: 2312

December 23, 2015 at 8:07 am

#1847755

Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

rs80 (12/23/2015)

Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.

Worktables (window spools).[/url]

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Sean Lange SSC Guru Points: 286562 More actions · Answer 4

rs80 (12/23/2015)

Sorry, there isn't any source code that I have. It's more of a theoretical question. What's the best way to learn how these functions work behind the scenes? I would think a SQL Server book would have more information.

Of course not, the source code is part of the engine. That was my point really. There are hundreds and hundreds of articles about LEAD/LAG. It seems like it gets the result set first and then applies the LEAD/LAG logic. That would make sense from a performance perspective and offers the benefits that a self join does not have because a self join has to look at the entire table instead of the result set of the query. There are plenty of others who know more than I do about this topic but I don't remember anywhere seeing anybody stating that "behind the scenes it is doing xxx".

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 5

I would read about spools, but also perhaps pose the question to Itzik Ben-Gan. He may know.

For most of us, I'm not sure it matters how this is implemented. I'd guess some sort of hash join myself, but who knows what's really going on. You might be able to read how PostgreSQL does their implementation of windowing. I thought they had this in there. My guess is there aren't a lot of algorithmic ways to do this efficiently.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 6

I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 7

Jeff Moden (12/23/2015)

I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:

Jeff...if I recall someone had a series of scripts that compared LAG to QU?

was it you or do you have a bookmark?

would like to read it again.

thanks

edit...this is the one I was referring to:

http://blog.waynesheffield.com/wayne/archive/2011/08/running-totals-in-denali-ctp3/

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 8

J Livingston SQL (12/23/2015)

Jeff Moden (12/23/2015)

I don't know what goes on behind the scenes but I do know that, whatever they did, it makes for relatively slow running totals. :crazy:

Jeff...if I recall someone had a series of scripts that compared LAG to QU?

was it you or do you have a bookmark?

would like to read it again.

thanks

edit...this is the one I was referring to:

http://blog.waynesheffield.com/wayne/archive/2011/08/running-totals-in-denali-ctp3/

That's the one I refer to, as well. I've recently executed it in SQL Server 2012 with nearly identical results.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jason A. Long SSC-Insane Points: 23731 More actions · Answer 9

This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...



IF OBJECT_ID('tempdb..#temp', 'U') IS NOT NULL

DROP TABLE #temp;

CREATE TABLE #temp (

	Amount DECIMAL(19,0) NOT NULL PRIMARY KEY

	);

;WITH

	n (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)

	),

	Tally (n) AS (

		SELECT

			ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

		FROM n n1, n n2, n n3, n n4, n n5, n n6

	)

INSERT #temp (Amount)

SELECT t.n FROM Tally t ORDER BY t.n;

--=====================================================================

DECLARE

	@b1 DATETIME2(7),

	@b2 DATETIME2(7),

	@b3 DATETIME2(7),

	@e3 DATETIME2(7),

	@V1 DECIMAL(19,0),

	@V2 DECIMAL(19,0)

;

SET @b1 = SYSDATETIME(); --=============================================

SELECT

	@V1 = t.Amount,

	@V2 = SUM(t.Amount) OVER (ORDER BY t.Amount ROWS UNBOUNDED PRECEDING)

FROM

	#temp t;

SET @b2 = SYSDATETIME(); --=============================================

IF OBJECT_ID('tempdb..#RunningTotal', 'U') IS NOT NULL

DROP TABLE #RunningTotal;

SELECT

	t.Amount,

	RT = CAST(NULL AS DECIMAL(19,0))

	INTO

	#RunningTotal

FROM

	#temp t;

CREATE UNIQUE NONCLUSTERED INDEX cix_RunningTotal_Amount ON #RunningTotal (Amount);

------------

SET @b3 = SYSDATETIME();

DECLARE

	@Amt DECIMAL(19,0) = 0,

	@rt DECIMAL(19,0) = 0;

UPDATE rt SET

	@rt = rt.RT = @rt + rt.Amount

FROM

	#RunningTotal rt (TABLOCKX)

OPTION(MAXDOP 1);

SET @e3 = SYSDATETIME();

---------

SELECT

	DATEDIFF(ms, @b1, @b2) AS WindowedFunction,

	DATEDIFF(ms, @b2, SYSDATETIME()) AS QuiryWithTableCreation,

	DATEDIFF(ms, @b3, @e3) AS QuirkyAlone

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 10

Jason A. Long (12/23/2015)

This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...

You've fallen into the same trap that half the world has fallen into when they test CSV splitters.

Look at your data. If you're testing to see which is faster for running totals, ask yourself if your test data actually meets the temporal requirements of what a running total actually is. Then ask yourself why the data you have would favor the Windowed SUM and how it could lead you into making the mistake of selecting the wrong method if performance were uber important. "The Devil's in the Data".

For a hint, look at the test data in Wayne's article and compare it against yours.

Also, you've not followed the rules for the Quirky Update. Do you know which big one you violated?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jason A. Long SSC-Insane Points: 23731 More actions · Answer 11

Jeff Moden (12/23/2015)

Jason A. Long (12/23/2015)

This is just a quick test... But according to this, the Windowed SUM is going faster... see what you think...

You've fallen into the same trap that half the world has fallen into when they test CSV splitters.

Look at your data. If you're testing to see which is faster for running totals, ask yourself if your test data actually meets the temporal requirements of what a running total actually is. Then ask yourself why the data you have would favor the Windowed SUM and how it could lead you into making the mistake of selecting the wrong method if performance were uber important. "The Devil's in the Data".

For a hint, look at the test data in Wayne's article and compare it against yours.

Also, you've not followed the rules for the Quirky Update. Do you know which big one you violated?

The ANCHOR!!! Damn it... I actually had it in there but was getting weird results (forgot the clustered index for about 5 mins), I must have pulled it out when I was trouble shooting and forgot to put it back...

Giving the referenced article a 'quick-ish' look...I do see that all of the tests involve performing an actual update to the dbo.TransactionDetail table... Providing (IMO) an advantage to the quirky update which doesn't work without updating a table...

So... I suppose that brings us into "proposed use" territory... Are we wanting to calculate the running total on the fly with a SELECT query or persist the running totals on the table itself?

If we want to persist the data, the quirky method has the advantage (assuming that the table already has the correct clustered index), but if we simply want to execute a select query, we have to add the cost of selecting the base data into a temp table and creating the clustered index...

With the Windowed SUM, the opposite is true... Persisting the data requires at least 1 index update that it doesn't need to do if it's only doing a select... In either case, one needs to be slowed down to achieve the results of the other.

If I have time over the weekend, I'll do some more testing using your harness... Both "calculated on the fly" and "persisted".

That said, I still have a tough time shrugging off Itzik's warning... Ordered UPDATE and Set-Based Solutions to Running Aggregates

I know some put more stock in the "documented vs undocumented behavior" argument, than others... So I'll just call it a tie breaker... (maybe even a thumb on the scale if it's close)

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 12

It's not just the anchor. You can't use a Non-Clustered Index for the QU because of the possibility of a "Merry-go-Round" index.

Also, your data is sorted on the Amt column... for running totals, there must be a temporal column such as a transaction date/time. The data is further flawed because it contains no duplicate amounts such as real life would have it.

The key to any testing is to make the test data as lifelike as possible. Anyone can come up with a special case where one thing beats another but those special cases aren't what's likely to happen in real life and it's real life that the code will need to handle.

The data you created isn't real life. The data that Wayne tested in his article is and it shows the SUM() Windowing function getting it's head handed back to it by the QU.

Yes, there are applications for SUM() OVER that are real life and will beat the QU, but not for real life running totals.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jason A. Long SSC-Insane Points: 23731 More actions · Answer 13

Jeff Moden (12/24/2015)

It's not just the anchor. You can't use a Non-Clustered Index for the QU because of the possibility of a "Merry-go-Round" index.

Yikes... That's what I get for eating paint chips in the morning... :Whistling:

Jeff Moden (12/24/2015)

Also, your data is sorted on the Amt column... for running totals, there must be a temporal column such as a transaction date/time. The data is further flawed because it contains no duplicate amounts such as real life would have it.

The key to any testing is to make the test data as lifelike as possible. Anyone can come up with a special case where one thing beats another but those special cases aren't what's likely to happen in real life and it's real life that the code will need to handle.

The data you created isn't real life. The data that Wayne tested in his article is and it shows the SUM() Windowing function getting it's head handed back to it by the QU.

Yes, there are applications for SUM() OVER that are real life and will beat the QU, but not for real life running totals.

How about this one?



USE TempDB

GO

     IF OBJECT_ID('TempDB.dbo.TransactionDetail') IS NOT NULL

        DROP TABLE TempDB.dbo.TransactionDetail

GO

 CREATE TABLE dbo.TransactionDetail

        (

        TransactionDetailID INT IDENTITY(1,1), --Temporal "tie-breaker"

        Date                DATETIME,

        AccountID           INT,

        Amount              MONEY,

		-- Removed to the "SELECT ONLY" test

        -- AccountRunningTotal MONEY,  --Running total across each account

        -- AccountRunningCount INT,    --Like "Rank" across each account

        NCID                INT,    --For "proof" later in the article

        CONSTRAINT PK_TransactionDetail_TransactionDetailID

                   PRIMARY KEY NONCLUSTERED (TransactionDetailID)

                   WITH FILLFACTOR = 100

        )

GO

 CREATE CLUSTERED INDEX IXC_Transaction_AccountID_Date_TransactionDetailID

     ON dbo.TransactionDetail (AccountID, Date, TransactionDetailID)

 CREATE NONCLUSTERED INDEX IX_Transaction_NCID

     ON dbo.TransactionDetail (NCID DESC)

GO

    SET NOCOUNT ON

  WHILE (ISNULL(IDENT_CURRENT('TransactionDetail'),0)) < 1000000

  BEGIN

         INSERT INTO dbo.TransactionDetail

                (Date, AccountID, Amount)

         SELECT TOP 10000

                --10 years worth of dates with times from 1/1/2000 to 12/31/2009

                CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME) AS Date,

                --100 different account numbers

                ABS(CHECKSUM(NEWID()))%100+1,

                --Dollar amounts from -99.99 to + 99.99

                CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY)

           FROM Master.dbo.SysColumns sc1

          CROSS JOIN Master.dbo.SysColumns sc2

    END

--===== Update the NCID column to be the reverse of the TransactionDetailID column

 UPDATE dbo.TransactionDetail

    SET NCID = 1000000 - TransactionDetailID + 1

GO

--	SELECT * FROM dbo.TransactionDetail td

--==================================================================================

--==================================================================================

DECLARE

	@b1 DATETIME2(7),

	@b2 DATETIME2(7),

	@b3 DATETIME2(7)

;

SET @b1 = SYSDATETIME();

SELECT

	td.TransactionDetailID,

	td.Date,

	td.AccountID,

	td.Amount,

	AccountRunningTotal = SUM(td.Amount) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),

	AccountRunningCount = COUNT(1) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),

	td.NCID

FROM

	dbo.TransactionDetail td

SET @b2 = SYSDATETIME();

IF OBJECT_ID('tempdb..#RunningTotals', 'U') IS NOT NULL

DROP TABLE #RunningTotals;

SELECT

	td.TransactionDetailID,

	td.[Date],

	td.AccountID,

	td.Amount,

	AccountRunningTotal = CAST(NULL AS MONEY),

	AccountRunningCount = CAST(NULL AS INT),

	td.NCID

	INTO #RunningTotals

FROM

	dbo.TransactionDetail td;

CREATE UNIQUE CLUSTERED INDEX cix_RunningTotals_AccountID_Date_TransactionDetailID

     ON #RunningTotals (AccountID, [Date], TransactionDetailID);

SET @b3 = SYSDATETIME();

DECLARE @PrevAccountID       INT

DECLARE @AccountRunningTotal MONEY

DECLARE @AccountRunningCount INT;

 UPDATE rt

    SET @AccountRunningTotal = rt.AccountRunningTotal = CASE

                                                     WHEN rt.AccountID = @PrevAccountID

                                                     THEN @AccountRunningTotal + rt.Amount

                                                     ELSE rt.Amount

                                                     END,

        @AccountRunningCount = rt.AccountRunningCount = CASE

                                                     WHEN rt.AccountID = @PrevAccountID

                                                     THEN @AccountRunningCount + 1

                                                     ELSE 1

                                                     END,

        @PrevAccountID = rt.AccountID

   FROM #RunningTotals rt WITH (TABLOCKX)

 OPTION (MAXDOP 1);

	SELECT

		rt.TransactionDetailID,

		rt.Date,

		rt.AccountID,

		rt.Amount,

		rt.AccountRunningTotal,

		rt.AccountRunningCount,

		rt.NCID

	FROM

		#RunningTotals rt

	ORDER BY

		rt.AccountID,

		rt.[Date],

		rt.TransactionDetailID;

SELECT

	DATEDIFF(ms, @b1, @b2) AS WindowedSUM,

	DATEDIFF(ms, @b2, SYSDATETIME()) AS QuirkyWTemp,

	DATEDIFF(ms, @b3, SYSDATETIME()) AS QuirkyAlone;

The above allows all rows to render in the display... I tested a version that dumps the values into variables



DECLARE

	@b1 DATETIME2(7),

	@b2 DATETIME2(7),

	@b3 DATETIME2(7),

	@rt MONEY,

	@rc int

;

SET @b1 = SYSDATETIME();

SELECT

	@rt = SUM(td.Amount) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING),

	@rc = COUNT(1) OVER (PARTITION BY td.AccountID ORDER BY td.Date, td.TransactionDetailID ROWS UNBOUNDED PRECEDING)

FROM

	dbo.TransactionDetail td

SET @b2 = SYSDATETIME();

IF OBJECT_ID('tempdb..#RunningTotals', 'U') IS NOT NULL

DROP TABLE #RunningTotals;

SELECT

	td.TransactionDetailID,

	td.[Date],

	td.AccountID,

	td.Amount,

	AccountRunningTotal = CAST(NULL AS MONEY),

	AccountRunningCount = CAST(NULL AS INT),

	td.NCID

	INTO #RunningTotals

FROM

	dbo.TransactionDetail td;

CREATE UNIQUE CLUSTERED INDEX cix_RunningTotals_AccountID_Date_TransactionDetailID

     ON #RunningTotals (AccountID, [Date], TransactionDetailID);

SET @b3 = SYSDATETIME();

DECLARE @PrevAccountID       INT

DECLARE @AccountRunningTotal MONEY

DECLARE @AccountRunningCount INT;

 UPDATE rt

    SET @AccountRunningTotal = rt.AccountRunningTotal = CASE

                                                     WHEN rt.AccountID = @PrevAccountID

                                                     THEN @AccountRunningTotal + rt.Amount

                                                     ELSE rt.Amount

                                                     END,

        @AccountRunningCount = rt.AccountRunningCount = CASE

                                                     WHEN rt.AccountID = @PrevAccountID

                                                     THEN @AccountRunningCount + 1

                                                     ELSE 1

                                                     END,

        @PrevAccountID = rt.AccountID

   FROM #RunningTotals rt WITH (TABLOCKX)

 OPTION (MAXDOP 1);

SELECT

	DATEDIFF(ms, @b1, @b2) AS WindowedSUM,

	DATEDIFF(ms, @b2, SYSDATETIME()) AS QuirkyWTemp,

	DATEDIFF(ms, @b3, SYSDATETIME()) AS QuirkyAlone;

Here are my results... [@@Version = Microsoft SQL Server 2014 - 12.0.4100.1 (X64) Apr 20 2015 17:29:27 Copyright (c) Microsoft Corporation Developer Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)]

-- with display rendering --



WindowedSUM	QuirkyWTemp	QuirkyAlone

4408		5662		5192

4868		5553		5132

4322		5631		5195

4321		5710		5257

4212		5678		5241

-- without display rendering --



WindowedSUM	QuirkyWTemp	QuirkyAlone

842		1358		905

843		1326		889

843		1357		905

843		1294		873

858		1342		889

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 14

Jason A. Long (12/24/2015)

How about this one?

Now you're cooking with gas. MUCH better data and test!... and, I stand corrected.

1. If all you want to do is display the data, SUM() OVER wins.

2. If you want to create a separate table with the running totals, SUM() OVER wins.

3. If you want to do an in-place update on the original table, Quirky Update wins. This one is what Wayne's test did.

I used your code for the 3 various tests above.

For #1, I used your test as is.

For #2, I added an "INTO" to the SUM() OVER code and commented out the final SELECT for the QU.

For #3, I used Wayne's code.

Both Wayne and I skipped testing for #1 and #2... a mistake for both of us. :blush:

Well, done, Jason! I learned something new today! :w00t: You should write an article on this.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions