Performance Tuning: Concatenation Functions and Some Tuning Myths

  • Jeff Moden (1/1/2008)


    You must have slept through that part of the 431 prep...

    Please tell me they don't actually teach how to use Triangular Joins to derive running totals 😛

    oops - I quoted a bit too much. I was talking about them actually introducing the concept of joins other than equijoins....(i.e. Gila's comment as to no mention of non equijoins).

    And no - I haven't yet run into any part talking about computing running totals.....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (1/1/2008)


    You must have slept through that part of the 431 prep...:)

    OK, I will admit I didn't do the 431 training. Did do all of the SQL developer training (for exams 441, 442) and I don't recall a discussion of joins

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Lovely example of a coding `performance issue` we found just before christmas...

    A collegue here overseas a system with a process they were running that was incredibly slow. (External product we have). Actually, one of a few processes that run slowly with that system 😉

    So I showed him the wonder that is `sql server profiler` 🙂 (Which some of our dba's here seem to know little about!)

    Spotted a strange pattern in the code being run. 400 instances of count(*) being called when we ran a single record of the process in question, running about 15ms per count(*). In a live situation, this may happen over increasing number of records. Only 100 records? 10 minutes of count(*)'s 🙂

    So we investigate more, and this same code was running all over the place.

    We contacted the suppliers and talked to a developer.

    ....Yes we use count(*) so our system can check if a table exists with a particular copy of our software or not

    Im sure you can think of your own series of polite questions you might ask a supplier regarding the above, suffice to say at this point we had some words with them, and await their next update with great interest 🙂

    Makes me think what other wonderments of code they have hidden away in their system!

    🙂

    martin

  • Thanks for the feedback, Martin. Yeah, I have a grand love for such 3rd party shenanigan's myself... my favorite is loading data from a table into a Temp Table, opening a cursor on the Temp Table, and then inserting one row at a time into another table from the cursor with an explicit transaction and no rollback handler. Ya just gotta wonder where some of those folks learned T-SQL.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Great article, Jeff. I've forwarded the link to my colleagues. We all need to be reminded once in a while that, no matter how hard we try, we don't necessarily write perfect code on the first try. Or, in my case, on the tenth try. :Whistling:

  • Heh... thanks for the feedback, Donald. I'm right there with ya on the 10 tries. 😀

    Like Colin was kind enough to point out, if folks think they can tune this type of "non-tunable" code just by throwing an index on it, they're probably probably wrong... even a brand new server with a garden hose running to it to keep it cool would still have problems choking down the bad code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Great article.

    What drives me insane about issues like this one (and I have had to create comma delimited files far too often through SQL) is that you never know what is going to be the best way of really doing it anyway. What I mean by that is when I pull this out of a function and then wrap it up into a cursor (the dreaded cursor) it runs virtually instantly when doing the 10,000/25 rows where as when running using the function, it takes over a minute as you point out. This held true on both SQL 2000 and SQL 2005. And doing this same thing with the 1,000,000/2,500 rows it takes about 10 seconds if you simply add an index on SomeID (my machine is 3.0 Ghz, 2GB RAM). It is far less readable and clearly not as nice as wrapping it up in a function, but it is faster and this makes very little sense given the fact that the code is essentially the same.

    Code to wrap in cursor (assumes that table TestData has already been created):

    if object_id('TestDataHold') is not null

    drop table TestDataHold

    go

    create table TestDataHold

    (

    someid int,

    CSVString char(8000)

    )

    go

    declare @y int

    declare @x varchar(8000)

    declare csr cursor for

    select distinct someid from dbo.TestData order by someid

    open csr

    fetch next from csr into @y

    while @@fetch_status = 0

    begin

    set @x=null

    select @x=ISNULL(@x+',', '')+cast(somecode as varchar(2)) FROM dbo.TestData where SomeID = @y order by RowNum

    insert into TestDataHold

    SELECT @y, @x

    fetch next from csr into @y

    end

    close csr

    deallocate csr

    select SomeID, CSVString from TestDataHold

    I suppose if it worked the same way every time, there may be no need for DBA's, eh? 😀

    ETA: This is using the original "bad" code, by the way. I just found it alarming how much better it worked when removed from function and wrapped in a cursor.

  • Jeff,

    What an absolutely well-written and informative article - I totally agree with the earlier post that remarked on this being a great way to start the new year. I seldom have the time or opportunity to visit this site anymore but specially made time to read this because I've always associated the author's name with exceptional quality and it turned out to be time incredibly well spent.

    I've always said that having something to say is as important as how well you say it - your article has both content and style - obviously your communication skills match your performance-tuning skills!:)

    Keep them coming Jeff!







    **ASCII stupid question, get a stupid ANSI !!!**

  • Nice, easy-read article. It even introduced me to a new function (CheckSum), which I looked up when reading your article. I particularly appreciated the code comments. I'm a big believer in documentation. Happy New Year. - JJ

  • Nice article.

    Just another way to do this in sql 2005

    with h(SomeID) as (SELECT DISTINCT SomeID from dbo.TestData )

    select h.SomeID, dbo.fnConcatTest(h.SomeID) from h

  • The actual performance tuning principle that this article touches on is the difference between distinct and group by constructs. Lots of beginner SQL programmers use distinct because they don't understand how to use group by. The fact is that while MSSQL is pretty good at figuring out distinct record sets, especially when the right indexes exist, a distinct record set with a computed column requires the column to be computed for every possible record in the result, while a group by tells the SQL engine which rows will be distinct, and thereby limits the possible records to those that have a different ID.

    The underlying theme here is that people often use distinct when group by is more appropriate because distinct is "easier" in that it requires less typing and doesn't give an errors when you forget to use an aggregate function or forget to add a column to the group by clause. However, there are far fewer occasions when distinct is really appropriate, then its distant relative, group by.

    My message would be to challenge any use of distinct in SQL the same way you might challenge a goto in procedural code. And secondly, teach the use of group by to as many SQL programmers as possible!

  • Eric: Due to the way I learned SQL, I've never used distinct. But your posting got me to thinking. I wonder when are the "occasions when distinct is really appropriate"? If it is so much more inefficient, why is it part of the language? Because it is part of the standard?

    I don't know if you know the answers to those questions or not. Now that I'm really thinking about Distinct in ways that I never did before, I'm trying to figure out when it would be a good idea. (You are probably thinking, "Argh!, that's not my point!" I know. Sorry.)

  • Martin Bastable (1/2/2008)


    ....

    We contacted the suppliers and talked to a developer.

    ....Yes we use count(*) so our system can check if a table exists with a particular copy of our software or not ....

    Martin, what did you suggest (if anything)? I would, off-hand, lean towards checking for the table name in sysobjects (in 2000) meself.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • From the standpoint of the SQL engine, distinct is equivalent to a group by that contains all of the columns in the select list. So if you understand the relationship between them, and more importantly you understand how to use group by clauses effectively, then you can decide when it makes sense to you to use a distinct select statement.

    For example, if my group by clause contains all of the columns in my result set, I'm really asking for a distinct list of values and I would normally use a distinct select. This occurs when A) You only have one or two columns in your result set, or B) When you have a badly organized set of data tables that you are reporting against. For example, if you are getting a list of email addresses from a table containing email messages, you could say, select email_address from messages group by email_address, or you could say, select distinct email_address from messages. Which is more clear? To me, either is acceptable. But if I instead have a select statement with 30 columns, and it turns out that one of the source tables occasionally contains duplicate records (and that's not something I can fix) I might use a distinct select statement. I certainly would not do a group by on all 30 columns simply because I believe that obscures the meaning of the statement.

    The original article hints that

    SELECT DISTINCT SomeID, dbo.fnConcatTest(SomeID) AS CSVString

    FROM dbo.TestData

    doesn't mean the same thing as

    SELECT SomeID, dbo.fnConcatTest(SomeID) AS CSVString

    FROM dbo.TestData

    GROUP BY SomeID

    even if they happen to return the same result set. Although restrictions are placed on udf's that they should be determinate (i.e. return the same value every time they are called with the same arguments and source data), and therefore dbo.fnConcatTest(SomeID) should evaluate the same for every repeated value of SomeID, it is actually possible to define a udf that returns a different value every time it is called, even with the same arguments [create a view that gets a current timestamp and return the value from that view in a udf]. In that case, asking SQL to return distinct rows is clearly different then asking SQL to group by SomeID and then return a value based on SomeID.

    I think that when learning SQL, it is really important to map each SQL expression to an English (or your native language) equivalent so that you learn to read SQL code like natural language. Then it is usually easy to identify the proper solution. For example

    SELECT DISTINCT SomeID, dbo.fnConcatTest(SomeID) AS CSVString

    FROM dbo.TestData

    is like saying, "Using data from the table TestData, for every row get the SomeID and evaluate the function fnConcatTest on SomeID, then return the distinct results". Where as

    SELECT SomeID, dbo.fnConcatTest(SomeID) AS CSVString

    FROM dbo.TestData

    GROUP BY SomeID

    is like saying "Using the data from the table TestData, summarize by SomeID, then return the summary of SomeID with the value of the function fnConcatTest for each summary row."

  • Eric: Thanks for taking the time to write out your thoughts!

Viewing 15 posts - 16 through 30 (of 82 total)

You must be logged in to reply to this topic. Login to reply