Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • mark hutchinson (5/9/2011)


    Can people use ngen to compile the assembly and save the JIT overhead?

    In theory, yes; but it's almost certainly not worth it. Without going into too many details on the whole ngen-vs-JIT thing, my view is that for SQL Server it's easier (and probably better in the end) to let JIT do it, even if it might be slower than usual on the first time through a particular code path.

  • Jeff

    The article has been published.

    http://www.experts-exchange.com/A_5410.html

    Enjoy,

    Mark

  • Thanks, Mark. Looks good.

    It's been a thousand years since I've even touched Access. Does it have any type of row numbering function available to it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Does it have any type of row numbering function available to it?

    Yes. MS-Access has a field/column definition data type of autonumber (long integer or GUID/ReplicationID data type). The long integer data type defaults to sequential numbers, but this can be changed to random.

    Such fields are usually used as primary keys, since they aren't reused once a subsequent row has been added to the table. There may be gaps in a sequence due to delete operations.

    Note: It is possible to reuse a sequence if rows are deleted at the end of a table and the database compacted before any new rows are inserted. Compacting never renumbers any of these autonumber values in the middle of a table -- gaps can happen.

    ======

    Unlike TSQL, there is no equivalent ROW_NUMBER() within a set of rows in an Access query. There is an .AbsolutePosition property for most recordsets in VBA code. And, as mentioned in my article, there is no system table from which I can glom a sequence of numbers.

    Mark

  • Would there be a way to use the running total function in Access to make a kind of ROW_NUMBER function?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (5/12/2011)


    Would there be a way to use the running total function in Access to make a kind of ROW_NUMBER function?

    To what running total function are you referring?

  • Hmmmm... I thought that Access had a running total function built in. I might be getting it confused with SSRS.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Just to keep everyone posted, I've submitted an update to the article including the tests on Nadrek's and Peter's functions. Since Peter's was the faster of the two, I've incorporated his code into the attachements of the article along with updated spreadsheeds, graphs, test code and, of course, copies of the function.

    Except for the CLR, I think we've probably got about the fastest cteTally based 8k splitter in the world going on now. Well done everyone! This has been one awesome thread and I'm glad to have had the opportunity to be a part of it.

    For those that can't wait for the updated article... please see the attached...

    Thanks again, folks. This has been a real joy. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (5/12/2011)


    Just to keep everyone posted, I've submitted an update to the article including the tests on Nadrek's and Peter's functions. Since Peter's was the faster of the two, I've incorporated his code into the attachements of the article along with updated spreadsheeds, graphs, test code and, of course, copies of the function.

    Except for the CLR, I think we've probably got about the fastest cteTally based 8k splitter in the world going on now. Well done everyone! This has been one awesome thread and I'm glad to have had the opportunity to be a part of it.

    For those that can't wait for the updated article... please see the attached...

    Thanks again, folks. This has been a real joy. 🙂

    It sure was Jeff, a few days ago I spend another evening trying to improve upon the current best result. Speed wise, I could make nothing faster, my best effort was actually a tiny bit slower yet still faster then all other alternatives! This version did however claim a lot less memory when dealing with smaller input strings. For memory usage I rely on the numbers supplied by the estimate and actual memory claims found in the query plan xml. This could in theory be important when the function is used frequently and concurrent with other memory hungry functions or itself. Not something that would give it an advantage in our current tests.

    Things I tried were a (20 x 20 x 20 = 8000 rows) cross join and also a (90 x 90 = 8100 rows) cross join. The latter also in combination with a calculated top operator to make the crossjoin dynamic ranging from (1 x 90 = 90 rows) upto (90 x 90 = 8100 rows). There is a some constant overhead in this operation in the optimal form and a quite a heavy one in all other forms. it does of course have the benefit of less memory pressure under typical workloads while being maybe less then a percent slower in the single thread benchmark.

    My first attempt to use a second top operator to reduce memory consumption was based on a ceiling( sqrt( datalength ( str ) ) ) kind of construct. This however is one of the worst performers, so I tried ( datalength( str ) + 89 ) / 90 instead as that one does not need explicit casting from float to int and has far simpler calculations.

    But in order to favor this different version over the current best, I really need a multi-threaded test benchmark and also place both solutions in an array of heavy real world uses and measure the effect on aggregated runtimes. Not an easy task and quite prone to measurement fluctuations not caused by a difference in code performance. We need to keep in mind that every bit of added complexity can have consequences in terms of query plan efficiency of the SQL statements that inline these splitter functions. And this currently unproven new version is certainly a bit more complex :(.

    I think it highlights why for heavy use there is no true alternative to the streaming CLR version. Even while I do not currently use CLR myself, it certainly will not suffer from the potentially heavy memory claims and plan complexities.

    One other idea I have been playing with but found no query hints for, was to see the effect of parallelization. Currently the cost estimates for each step in the query plan are far below the point where SQL server will start using multiple threads. In theory the SQL solution does lend itself to parallelization, so in theory it should be able to get more speedups. I may need to install a standalone version of SQL Server and tweak the server settings to cause nearly every plan to get parallelized and see if my suspicion is correct.

  • Hmmmm... I thought that Access had a running total function built in. I might be getting it confused with SSRS.

    Jeff, I believe the running total you are thinking of is in the reports where you can have a "Running Sum" (a property of a text box) over a group or the total report.

    On another note on your splitter function, I was curious why you have

    ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1) in your function. Unless you need to know the order the data was originally in, I would think that it would add unnecessary overhead. While I haven't done any extensive testing, the "Query Cost" (from the Execution plan) seems to be much higher with the ROW_NUMBER() than without it (which I believe doesn't always translate to drastically worse performance).

    I must also say that your new splitter is great (especially with all the improvements made by the others) and I plan on replacing our current function (which is actually your previous splitter). Thanks for making my life easier! 🙂

    Jason

  • JTSash (5/13/2011)


    Hmmmm... I thought that Access had a running total function built in. I might be getting it confused with SSRS.

    Jeff, I believe the running total you are thinking of is in the reports where you can have a "Running Sum" (a property of a text box) over a group or the total report.

    On another note on your splitter function, I was curious why you have

    ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1) in your function. Unless you need to know the order the data was originally in, I would think that it would add unnecessary overhead. While I haven't done any extensive testing, the "Query Cost" (from the Execution plan) seems to be much higher with the ROW_NUMBER() than without it (which I believe doesn't always translate to drastically worse performance).

    I must also say that your new splitter is great (especially with all the improvements made by the others) and I plan on replacing our current function (which is actually your previous splitter). Thanks for making my life easier! 🙂

    Jason

    Excuse me when I am wrong with this as I did not verified this myself.

    This new function does returns one record when the input is NULL. Just from memory, I think the old one did not behave this way. Make sure your existing code works as intended with this behavior.

  • peter-757102 (5/13/2011)


    Excuse me when I am wrong with this as I did not verified this myself.

    This new function does returns one record when the input is NULL. Just from memory, I think the old one did not behave this way. Make sure your existing code works as intended with this behavior.

    Thanks for the heads up. I'll verify this before I implement anything.

    Jason

  • peter-757102 (5/13/2011)


    One other idea I have been playing with but found no query hints for, was to see the effect of parallelization. Currently the cost estimates for each step in the query plan are far below the point where SQL server will start using multiple threads. In theory the SQL solution does lend itself to parallelization, so in theory it should be able to get more speedups. I may need to install a standalone version of SQL Server and tweak the server settings to cause nearly every plan to get parallelized and see if my suspicion is correct.

    I had a play with parallelism last week. There aren't any query hints (sadly) to force a parallel plan, and deficiencies in the current costing model mean that SQL Server will not choose a parallel plan on any of the current T-SQL functions, regardless of the 'cost threshold for parallelism setting (even zero) because the serial plan alternative always costs lower.

    To be clear, the expensive TSQL operations are the string manipulations - SUBSTRING and CHARINDEX - rather than whatever method is used to build the numbers table. These string functions do not have a realistic CPU cost in the model - a longstanding problem with Compute Scalars in general. It is possible to obtain a parallel plan with the CLR function because the optimizer guesses quite a high row count and cost for its unknown contents.

    It is possible to use undocumented DBCC commands to increase the modeled CPU cost and generate parallel plans for the TSQL functions. As you might expect (because the inefficiencies of the TSQL string functions mean performance is constrained on CPU) parallel plans improve the elapsed time markedly - making them as 'fast' as the CLR solution in some cases.

    There are several reasons I didn't submit these results:

    1. Parallelism only benefits the larger tests, and hurts performance on the smaller ones.

    2. You have to be careful to avoid things like the so-called 'performance spool' aka sort + lazy spool combination in the resulting query plan.

    3. while elapsed time improves (and this is the only thing measured by the test rig) the CPU cost rises dramatically, of course.

    4. Undocumented DBCC commands are needed to produce the parallel plan to start with. For use on a production system, we would need to capture the exact XML show plan and use a plan guide or USE PLAN hint.

    5. Speed != efficiency. There is no good reason not to use CLR for this, or the many other scenarios where TSQL sucks, and set-based solutions do not scale.

    There are probably other valid objections too, but you get my drift.

    You can find more about plan costing and the DBCC commands on my blog:

    http://sqlblog.com/blogs/paul_white/archive/2010/09/01/inside-the-optimizer-plan-costing.aspx

    Paul

  • Thank you Paul, you saved me some time right there!

    I am only going to check one more idea (as stubborn as I am) for the specialized and most frequent occurring splitting use. The case is for strings with only a few but variable number of elements. I should know more later this weekend and if it has merit.

  • erroneous post - content deleted

    Tom

Viewing 15 posts - 226 through 240 (of 981 total)

You must be logged in to reply to this topic. Login to reply