Hidden RBAR: Triangular Joins

  • Charles Kincaid (12/9/2007)


    I was once quoted the following paradox:

    Given set A, which is a set of all sets that do not contain themselves as proper sub sets, is set A in set A?

    The mathematician that gave me that was a former coworker. I think that he got his doctorate on that. He said that it was the only true paradox.

    Bah, that's no paradox! It has a very easy answer: Set A does contain itself, but only as a highly improper and morally questionable subset. 😛

    Back to the subject, this is a good article. It explains to me exactly why a few queries I've written take so long to run, and I look forward to the follow-up to maybe improve those queries.

    I'd like to suggest using the term "single-set processing" in place of "set-based programming", though, because (to me, anyway) it more closely fits the ideal being described, and creates an immediate contrast to the multiple sets generated by triangular joins.

  • OK I am a simplistic TSQL programmer - I appreciated the article. I am working on some code in a payroll extract project and have run squarely into this problem. Most of our work has been with smaller sets of data (smaller customers). I now find I am looking at 70,000 - 80,000 rows in the extract set and it has performed JUST AS YOU DESCRIBED! Too Funny, I will be working on the problem, and looking forward to the next article with some suggested new ideas. Thanks again.

  • Although I would stay away from cursors on a couple million row table this would be the best I can come up with.

    Here is a 3x solution (one pass to build the cursor, one pass to build the table, one pass to display):

    DECLARE @OrderID AS INT

    DECLARE @Freight AS MONEY

    DECLARE @SumFreight AS MONEY

    DECLARE @CntFreight AS INT

    DECLARE @TableResult TABLE

    (

    [OrderID] INT NOT NULL,

    [Freight] MONEY NULL,

    [RunningTotal] MONEY NOT NULL,

    [RunningCount] INT NOT NULL

    )

    SET @SumFreight = 0.0

    SET @CntFreight = 0

    DECLARE [RunningTotals]

    CURSOR FORWARD_ONLY

    FOR

    SELECT

    [OrderID],

    [Freight]

    FROM [dbo].[Orders]

    ORDER BY

    [OrderID]

    OPEN [RunningTotals]

    FETCH NEXT FROM [RunningTotals] INTO @OrderID, @Freight

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SELECT

    @SumFreight = @SumFreight + ISNULL(@Freight, 0.0),

    @CntFreight = @CntFreight + 1

    INSERT @TableResult

    SELECT

    @OrderID,

    @Freight,

    @SumFreight,

    @CntFreight

    FETCH NEXT FROM [RunningTotals] INTO @OrderID, @Freight

    END

    CLOSE [RunningTotals]

    DEALLOCATE [RunningTotals]

    SELECT

    *

    FROM @TableResult

  • TheSQLGuru (1/16/2009)


    Jeff:

    1) For your follow-on articles, when you demonstrate the 'tsql trick' solution, please make sure you caveat the hell out of it about how and why you can get incorrect results.

    2) Because of 1 above, include other solutions (both set-based and cursor/while loop), preferably with benchmarks. Lets get the article to be THE one on the web to point users to for these types of solutions.

    3) "Heh... lost leader for one or two articles coming up... " should be "Heh... loss leader for one or two articles coming up... " 😀

    Funny you should mention that... I'm in the process of rewriting the bugger because of some of the very caveats and lessons learned mentioned both in the comments that followed the article and because of good prompting by folks like yourself. I've also got a few more examples that show what can go wrong and (lesson learned on my part) why you absolutely cannot rely on an index hint to replace an ORDER BY on SELECTs. I'll also explain that the ORDER BY won't actually show up as a SORT in the execution plan if the optimizer determines that it doesn't actually need to do a SORT. And, make no doubt about it, there's a huge amount of difference between an ORDER BY in the code and the presence (or not) of a SORT in the execution plan.

    Thanks for your support and friendly feedback ,especially on that particular article, Kevin.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • sfrazier (1/16/2009)


    OK I am a simplistic TSQL programmer - I appreciated the article. I am working on some code in a payroll extract project and have run squarely into this problem. Most of our work has been with smaller sets of data (smaller customers). I now find I am looking at 70,000 - 80,000 rows in the extract set and it has performed JUST AS YOU DESCRIBED! Too Funny, I will be working on the problem, and looking forward to the next article with some suggested new ideas. Thanks again.

    Please see the following article for a high speed solution to the running total problem. BE ADVISED, please... if you do it exactly right, it will work exacly right. If you leave out even the proverbial "hair of the frog" on that bit of magic, you will come up with incorrect answers. You may actually want to use the solution with "Order By" that I suggest although, as I said, if you use the fastest method correctly, it will work correctly. It'll do 1 million rows in 7 seconds or less.

    http://qa.sqlservercentral.com/articles/Advanced+Querying/61716/

    Also be advise that there is some erroneous information in that link. Although all of the examples work exactly as advertised, I've found a proof that you absolutely must NOT use an index hint to replace an ORDER BY in a SELECT. It works fine for the trick or "Quirky" update, but never use the method for replacing an ORDER BY in a SELECT.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I'm confused. The article states it was published today 1/16/2009, yet this discussion was started 12/6/2007! Even the first 9 pages of responses are from 2007. Have we really been waiting over a year for a response? And instead we get a repeat? Huh?

  • sknox (1/16/2009)


    Back to the subject, this is a good article. It explains to me exactly why a few queries I've written take so long to run, and I look forward to the follow-up to maybe improve those queries.

    Thank you for the feedback... if you look carefully at the listing on the homepage, this is actually a reprint from December of 2007... I don't know why they changed the date on the actual article.

    Please see the my last post (2 above this one) for the follow-on article along with the caveats I mentioned in that post, as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Dan Guzman (1/16/2009)


    I'm confused. The article states it was published today 1/16/2009, yet this discussion was started 12/6/2007! Even the first 9 pages of responses are from 2007. Have we really been waiting over a year for a response? And instead we get a repeat? Huh?

    The article was actually first published on 12/6/2007 and it actually says that on the SSC home page where the article was listed. Why they decided to change the date on the actual article is anyone guess.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • We rerun the best articles at times to ensure that people see them. There's no good way to republish the comments.

    There's a bug on the article page that doesn't show the original publication date.

  • Jeff Moden (1/16/2009)


    The article was actually first published on 12/6/2007 and it actually says that on the SSC home page where the article was listed. Why they decided to change the date on the actual article is anyone guess.

    You must be tired of answering that question already today. FYI, many people, like myself, will be launching into the article from today's SSC Newsletter and thus miss the clearly noted re-publish date.

    Thanks for diligently following up on all the questions/comments and providing the extra info!

    lm

  • leon mckay (1/16/2009)


    Jeff Moden (1/16/2009)


    The article was actually first published on 12/6/2007 and it actually says that on the SSC home page where the article was listed. Why they decided to change the date on the actual article is anyone guess.

    You must be tired of answering that question already today. FYI, many people, like myself, will be launching into the article from today's SSC Newsletter and thus miss the clearly noted re-publish date.

    Thanks for diligently following up on all the questions/comments and providing the extra info!

    lm

    Thank you, Sir, for the kind feedback. I actually went back to the "release post" (the first one on this thread) and added some information that will, hopefully, keep the confusion level down on this little forum bug.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks Jeff. In your example, I presume everyone would be running around trying to figure out why tempdb was filling up and without 2005's dmvs would find that task somewhat daunting. I appreciate you going into the math about the joins and providing a good example of RBAR that wouldn't seem obvious at first glance.

  • magarity kerns (8/15/2008)


    By the most amazing coincidence, we have just discovered a developer's query is running out of tempdb space because of the following as his join clause:

    where

    a.acct_nbr > b1.acct_nbr

    and a.acct_nbr <= b2.acct_nbr

    and a.acct_nbr = b.acct_nbr

    I'm going to forward your article on triangle joins to him after we finishing beating him in the alley out back.

    Mararity... heh... every time I see your comment at the end of the post above, I laugh out loud. That's so funny and not unlike the things I've been known to say... especially if you know anything about "pork chops". 😛 You must have a very good group of people in your office to have such good humor.

    Because of a date problem when they republish articles like this one, I've found it necessary to update the orginal post on this thread. There, you'll find the link to the running total solution I promised. Please take the time to read my warning that, despite what I say in the article (it's being rewritten), you simply cannot rely on an index hint to return an ordered result for a SELECT. The highspeed update method I demonstrate is still rock solid... many good folks have tried to make it fail and it has not.

    Thanks again for the great laugh.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Caleb (1/16/2009)


    Thanks Jeff. In your example, I presume everyone would be running around trying to figure out why tempdb was filling up and without 2005's dmvs would find that task somewhat daunting. I appreciate you going into the math about the joins and providing a good example of RBAR that wouldn't seem obvious at first glance.

    Thanks, Caleb. Just remember a couple of things...

    First, a seemingly innocent and proper looking join can also make a kind of "triangular" join if there's a 1 to many relationship. In fact it can get much worse than a simple trigangular join.

    The other thing to remember is that not everything that looks like a triangular join, is. Even if it does turn out to be triangular, not all triangular joins are bad. For example, the classic problem of finding missing numbers in a sequence of numbers is most quickly solved by what looks like a horrid triangular join built into a correlated sub-query... but in truth, nothing I've seen can beat it... here's the example complete with some data generation code... there's actually two examples of finding gaps near the end of the code...

    DROP TABLE #MyTest

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "MyID" has a range of 1 to 1,000,000 unique numbers starting at 82011000000001

    -- Jeff Moden

    SELECT TOP 1000000

    MyID = IDENTITY(BIGINT,82011000000001,1)

    INTO #MyTest

    FROM Master.dbo.SysColumns sc1,

    Master.dbo.SysColumns sc2

    --===== A table is not properly formed unless a Primary Key has been assigned

    -- Takes about 3 seconds to execute.

    ALTER TABLE #MyTest

    ADD PRIMARY KEY CLUSTERED (MyID)

    --===== Now, let's add a calculated column to display leading zeros in a "field" of 15 characters

    -- like the original problem.

    ALTER TABLE #MyTest

    ADD DisplayMyID AS RIGHT('000000000000000'+CAST(MyID AS VARCHAR(15)),15)

    --===== Delete some know rows to demo the gap detection code

    -- This deletes 50 rows spaced 2000 apart in the given range

    -- to demo small gaps

    DELETE #MyTest

    WHERE MyID BETWEEN 82011000500001 AND 82011000600000

    AND MyID %2000 = 0

    -- This deletes 100,000 rows in a given range to demo large gaps

    DELETE #MyTest

    WHERE MyID BETWEEN 82011000600001 AND 82011000700000

    --===== Find the "gap ranges"

    -- Finds leading edge of "islands" and then computes the gaps

    -- This assumes that gaps include any whole number greater than 0

    SELECT GapStart = (SELECT ISNULL(MAX(b.MyID),0)+1

    FROM #MyTest b

    WHERE b.MyID < a.MyID),

    GapEnd = MyID - 1

    FROM #MyTest a

    WHERE a.MyID NOT IN (SELECT MyID + 1 FROM #MyTest)

    --===== Find the "gap ranges"

    -- Finds leading edge of "islands" and then computes the gaps

    -- This assumes that gaps include only from the first number on

    -- and does so without slowing the code down

    SELECT *

    FROM (--==== Find all the gaps like before

    SELECT GapStart = (SELECT ISNULL(MAX(b.MyID),0)+1

    FROM #MyTest b

    WHERE b.MyID < a.MyID),

    GapEnd = MyID - 1

    FROM #MyTest a

    WHERE a.MyID NOT IN (SELECT MyID + 1 FROM #MyTest)

    )g

    WHERE g.GapStart > 1

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff, it strikes me that pretty much all of your examples here and in other threads have a clustered index on the key of concern. What happens when it is a non-clustered index? Or worse, no index? Often times people don't have the luxury of putting indexes on a column, especially clustered one's. Maybe it is time for some benchmarking on 'real world' tables. I think other mechanisms (including cursors and while loops and non-breaking set logic) may come to the fore in such cases as being more efficient.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 136 through 150 (of 255 total)

You must be logged in to reply to this topic. Login to reply