Hidden RBAR: Triangular Joins

Question

Post reply

Hidden RBAR: Triangular Joins

Viewing 15 posts - 136 through 150 (of 255 total)

You must be logged in to reply to this topic. Login to reply

sknox SSChampion Points: 12345 More actions · Answer 1

Charles Kincaid (12/9/2007)

I was once quoted the following paradox:

Given set A, which is a set of all sets that do not contain themselves as proper sub sets, is set A in set A?

The mathematician that gave me that was a former coworker. I think that he got his doctorate on that. He said that it was the only true paradox.

Bah, that's no paradox! It has a very easy answer: Set A does contain itself, but only as a highly improper and morally questionable subset. 😛

Back to the subject, this is a good article. It explains to me exactly why a few queries I've written take so long to run, and I look forward to the follow-up to maybe improve those queries.

I'd like to suggest using the term "single-set processing" in place of "set-based programming", though, because (to me, anyway) it more closely fits the ideal being described, and creates an immediate contrast to the multiple sets generated by triangular joins.

sfrazier Newbie Points: 1 More actions · Answer 2

OK I am a simplistic TSQL programmer - I appreciated the article. I am working on some code in a payroll extract project and have run squarely into this problem. Most of our work has been with smaller sets of data (smaller customers). I now find I am looking at 70,000 - 80,000 rows in the extract set and it has performed JUST AS YOU DESCRIBED! Too Funny, I will be working on the problem, and looking forward to the next article with some suggested new ideas. Thanks again.

LeeBear35 SSC Enthusiast Points: 124 More actions · Answer 3

Although I would stay away from cursors on a couple million row table this would be the best I can come up with.

Here is a 3x solution (one pass to build the cursor, one pass to build the table, one pass to display):



DECLARE @OrderID AS INT

DECLARE @Freight AS MONEY

DECLARE @SumFreight AS MONEY

DECLARE @CntFreight AS INT

DECLARE @TableResult TABLE

(

	[OrderID] INT NOT NULL,

	[Freight] MONEY NULL,

	[RunningTotal] MONEY NOT NULL,

	[RunningCount] INT NOT NULL

) 

SET @SumFreight = 0.0

SET @CntFreight = 0

DECLARE [RunningTotals]

CURSOR FORWARD_ONLY

FOR

	SELECT

		[OrderID],

		[Freight]

	FROM [dbo].[Orders]

	ORDER BY

		[OrderID]

OPEN [RunningTotals]

FETCH NEXT FROM [RunningTotals] INTO @OrderID, @Freight

WHILE @@FETCH_STATUS = 0

BEGIN

	SELECT

		@SumFreight = @SumFreight + ISNULL(@Freight, 0.0),

		@CntFreight = @CntFreight + 1

	INSERT @TableResult

	SELECT

		@OrderID,

		@Freight,

		@SumFreight,

		@CntFreight

	FETCH NEXT FROM [RunningTotals] INTO @OrderID, @Freight

END

CLOSE [RunningTotals]

DEALLOCATE [RunningTotals]

SELECT

	*

FROM @TableResult

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 4

TheSQLGuru (1/16/2009)

Jeff:

1) For your follow-on articles, when you demonstrate the 'tsql trick' solution, please make sure you caveat the hell out of it about how and why you can get incorrect results.

2) Because of 1 above, include other solutions (both set-based and cursor/while loop), preferably with benchmarks. Lets get the article to be THE one on the web to point users to for these types of solutions.

3) "Heh... lost leader for one or two articles coming up... " should be "Heh... loss leader for one or two articles coming up... " 😀

Funny you should mention that... I'm in the process of rewriting the bugger because of some of the very caveats and lessons learned mentioned both in the comments that followed the article and because of good prompting by folks like yourself. I've also got a few more examples that show what can go wrong and (lesson learned on my part) why you absolutely cannot rely on an index hint to replace an ORDER BY on SELECTs. I'll also explain that the ORDER BY won't actually show up as a SORT in the execution plan if the optimizer determines that it doesn't actually need to do a SORT. And, make no doubt about it, there's a huge amount of difference between an ORDER BY in the code and the presence (or not) of a SORT in the execution plan.

Thanks for your support and friendly feedback ,especially on that particular article, Kevin.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 5

sfrazier (1/16/2009)

OK I am a simplistic TSQL programmer - I appreciated the article. I am working on some code in a payroll extract project and have run squarely into this problem. Most of our work has been with smaller sets of data (smaller customers). I now find I am looking at 70,000 - 80,000 rows in the extract set and it has performed JUST AS YOU DESCRIBED! Too Funny, I will be working on the problem, and looking forward to the next article with some suggested new ideas. Thanks again.

Please see the following article for a high speed solution to the running total problem. BE ADVISED, please... if you do it exactly right, it will work exacly right. If you leave out even the proverbial "hair of the frog" on that bit of magic, you will come up with incorrect answers. You may actually want to use the solution with "Order By" that I suggest although, as I said, if you use the fastest method correctly, it will work correctly. It'll do 1 million rows in 7 seconds or less.

http://qa.sqlservercentral.com/articles/Advanced+Querying/61716/

Also be advise that there is some erroneous information in that link. Although all of the examples work exactly as advertised, I've found a proof that you absolutely must NOT use an index hint to replace an ORDER BY in a SELECT. It works fine for the trick or "Quirky" update, but never use the method for replacing an ORDER BY in a SELECT.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Dan Guzman - Not the MVP Hall of Fame Points: 3757 More actions · Answer 6

Dan Guzman - Not the MVP

Hall of Fame

Points: 3757

January 16, 2009 at 10:52 am

#928347

I'm confused. The article states it was published today 1/16/2009, yet this discussion was started 12/6/2007! Even the first 9 pages of responses are from 2007. Have we really been waiting over a year for a response? And instead we get a repeat? Huh?

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 7

sknox (1/16/2009)

Back to the subject, this is a good article. It explains to me exactly why a few queries I've written take so long to run, and I look forward to the follow-up to maybe improve those queries.

Thank you for the feedback... if you look carefully at the listing on the homepage, this is actually a reprint from December of 2007... I don't know why they changed the date on the actual article.

Please see the my last post (2 above this one) for the follow-on article along with the caveats I mentioned in that post, as well.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 8

Dan Guzman (1/16/2009)

I'm confused. The article states it was published today 1/16/2009, yet this discussion was started 12/6/2007! Even the first 9 pages of responses are from 2007. Have we really been waiting over a year for a response? And instead we get a repeat? Huh?

The article was actually first published on 12/6/2007 and it actually says that on the SSC home page where the article was listed. Why they decided to change the date on the actual article is anyone guess.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 9

We rerun the best articles at times to ensure that people see them. There's no good way to republish the comments.

There's a bug on the article page that doesn't show the original publication date.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

leon mckay Grasshopper Points: 16 More actions · Answer 10

Jeff Moden (1/16/2009)

The article was actually first published on 12/6/2007 and it actually says that on the SSC home page where the article was listed. Why they decided to change the date on the actual article is anyone guess.

You must be tired of answering that question already today. FYI, many people, like myself, will be launching into the article from today's SSC Newsletter and thus miss the clearly noted re-publish date.

Thanks for diligently following up on all the questions/comments and providing the extra info!

lm

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

leon mckay (1/16/2009)

Jeff Moden (1/16/2009)

The article was actually first published on 12/6/2007 and it actually says that on the SSC home page where the article was listed. Why they decided to change the date on the actual article is anyone guess.

You must be tired of answering that question already today. FYI, many people, like myself, will be launching into the article from today's SSC Newsletter and thus miss the clearly noted re-publish date.

Thanks for diligently following up on all the questions/comments and providing the extra info!

lm

Thank you, Sir, for the kind feedback. I actually went back to the "release post" (the first one on this thread) and added some information that will, hopefully, keep the confusion level down on this little forum bug.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Caleb-142520 SSC Enthusiast Points: 128 More actions · Answer 12

Caleb-142520

SSC Enthusiast

Points: 128

January 16, 2009 at 12:01 pm

#928410

Thanks Jeff. In your example, I presume everyone would be running around trying to figure out why tempdb was filling up and without 2005's dmvs would find that task somewhat daunting. I appreciate you going into the math about the joins and providing a good example of RBAR that wouldn't seem obvious at first glance.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 13

magarity kerns (8/15/2008)

By the most amazing coincidence, we have just discovered a developer's query is running out of tempdb space because of the following as his join clause:

where

a.acct_nbr > b1.acct_nbr

and a.acct_nbr <= b2.acct_nbr

and a.acct_nbr = b.acct_nbr

I'm going to forward your article on triangle joins to him after we finishing beating him in the alley out back.

Mararity... heh... every time I see your comment at the end of the post above, I laugh out loud. That's so funny and not unlike the things I've been known to say... especially if you know anything about "pork chops". 😛 You must have a very good group of people in your office to have such good humor.

Because of a date problem when they republish articles like this one, I've found it necessary to update the orginal post on this thread. There, you'll find the link to the running total solution I promised. Please take the time to read my warning that, despite what I say in the article (it's being rewritten), you simply cannot rely on an index hint to return an ordered result for a SELECT. The highspeed update method I demonstrate is still rock solid... many good folks have tried to make it fail and it has not.

Thanks again for the great laugh.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 14

Caleb (1/16/2009)

Thanks Jeff. In your example, I presume everyone would be running around trying to figure out why tempdb was filling up and without 2005's dmvs would find that task somewhat daunting. I appreciate you going into the math about the joins and providing a good example of RBAR that wouldn't seem obvious at first glance.

Thanks, Caleb. Just remember a couple of things...

First, a seemingly innocent and proper looking join can also make a kind of "triangular" join if there's a 1 to many relationship. In fact it can get much worse than a simple trigangular join.

The other thing to remember is that not everything that looks like a triangular join, is. Even if it does turn out to be triangular, not all triangular joins are bad. For example, the classic problem of finding missing numbers in a sequence of numbers is most quickly solved by what looks like a horrid triangular join built into a correlated sub-query... but in truth, nothing I've seen can beat it... here's the example complete with some data generation code... there's actually two examples of finding gaps near the end of the code...

DROP TABLE #MyTest



GO

--===== Create and populate a 1,000,000 row test table.

     -- Column "MyID" has a range of 1 to 1,000,000 unique numbers starting at 82011000000001

     -- Jeff Moden

 SELECT TOP 1000000

        MyID       = IDENTITY(BIGINT,82011000000001,1)

   INTO #MyTest

   FROM Master.dbo.SysColumns sc1,

        Master.dbo.SysColumns sc2

--===== A table is not properly formed unless a Primary Key has been assigned

     -- Takes about 3 seconds to execute.

  ALTER TABLE #MyTest

        ADD PRIMARY KEY CLUSTERED (MyID)

--===== Now, let's add a calculated column to display leading zeros in a "field" of 15 characters

     -- like the original problem.

  ALTER TABLE #MyTest

    ADD DisplayMyID AS RIGHT('000000000000000'+CAST(MyID AS VARCHAR(15)),15)

--===== Delete some know rows to demo the gap detection code

     -- This deletes 50 rows spaced 2000 apart in the given range

     -- to demo small gaps

 DELETE #MyTest

  WHERE MyID BETWEEN 82011000500001 AND 82011000600000

    AND MyID %2000 = 0

     -- This deletes 100,000 rows in a given range to demo large gaps

 DELETE #MyTest

  WHERE MyID BETWEEN 82011000600001 AND 82011000700000

--===== Find the "gap ranges"

     -- Finds leading edge of "islands" and then computes the gaps

     -- This assumes that gaps include any whole number greater than 0

 SELECT GapStart = (SELECT ISNULL(MAX(b.MyID),0)+1

                      FROM #MyTest b

                     WHERE b.MyID < a.MyID),

        GapEnd = MyID - 1

   FROM #MyTest a

  WHERE a.MyID NOT IN (SELECT MyID + 1 FROM #MyTest)

--===== Find the "gap ranges"

     -- Finds leading edge of "islands" and then computes the gaps

     -- This assumes that gaps include only from the first number on

     -- and does so without slowing the code down

 SELECT *

   FROM (--==== Find all the gaps like before

         SELECT GapStart = (SELECT ISNULL(MAX(b.MyID),0)+1

                              FROM #MyTest b

                             WHERE b.MyID < a.MyID),

                GapEnd = MyID - 1

           FROM #MyTest a

          WHERE a.MyID NOT IN (SELECT MyID + 1 FROM #MyTest)

        )g

  WHERE g.GapStart > 1

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 15

Jeff, it strikes me that pretty much all of your examples here and in other threads have a clustered index on the key of concern. What happens when it is a non-clustered index? Or worse, no index? Often times people don't have the luxury of putting indexes on a column, especially clustered one's. Maybe it is time for some benchmarking on 'real world' tables. I think other mechanisms (including cursors and while loops and non-breaking set logic) may come to the fore in such cases as being more efficient.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service