Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Sean Lange (9/4/2014)


    Solomon Rutzky (9/4/2014)


    Eirikur, I completely agree with the statements of "just think how much is going to fail down the line if suddenly NULLs become empty strings" and "if the value is not there it is truly unknown and ... that is a NULL, only NULL and nothing ... but NULL". This is why I agree with Jeff in disagreeing with the idea that the split function could, under some other interpretation, return NULL for any element. A NULL, as you stated, means "not there", hence could never have been part of the set that was joined together into a single, delimited string in the first place. If a set contains:

    Bob

    NULL

    Sally

    then the result of that set joined (as in String.Join(",")) together would be:

    Bob,Sally

    On the other hand, the following set:

    Bob

    {empty string}

    Sally

    would result in:

    Bob,,Sally

    Splitting either of those results on the comma would give you back the original set, with the exception, of course, of the missing element in the first set: the known unknown. If the reduced number of elements (due to NULLs not being represented in the CSV list) presents a problem for the operation, then either:

  • send separately the original number of elements (the diff between that and the result from the Split() would be the number of NULL elements)
  • don't allow the source string fields to be NULL
  • do an ISNULL / COALESCE such that NULLs are converted to an empty string or some other reserved designater that the down-stream code knows how to interpret
  • Basically, in terms of SQL Server (and not Oracle ;-)), splitting a string can only return a NULL if the input itself is NULL.

    Take care,

    Solomon..

    According to your logic what would a splitter return if the input string had a trailing comma?

    Bob,Sally,

    By no means am I saying I disagree with your sentiments just wondering. Is this a 2 element list or 3? Is the NULL there or is it not? Is the third element an empty string, a NULL or not returned?

    Hi Sean. In the case of:

    Bob,Sally,

    The result should be 3 records:

    Bob

    Sally

    {empty string}

    A NULL cannot be there as it was a non-NULL string that was split. And there are two delimiters so there are necessarily three elements to split. I find that this is the common interpretation as most of the set-to-single-string concatenations I have seen (usually using FOR XML PATH('')) strip off the trailing delimiter.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (9/4/2014)

    Hi Sean. In the case of:

    Bob,Sally,

    The result should be 3 records:

    Bob

    Sally

    {empty string}

    A NULL cannot be there as it was a non-NULL string that was split. And there are two delimiters so there are necessarily three elements to split. I find that this is the common interpretation as most of the set-to-single-string concatenations I have seen (usually using FOR XML PATH('')) strip off the trailing delimiter.

    Take care,

    Solomon...

    Oh good...I was getting confused with your explanation. That is how I would expect it too.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Solomon Rutzky (9/4/2014)


    Sean Lange (9/4/2014)


    Solomon Rutzky (9/4/2014)


    Eirikur, I completely agree with the statements of "just think how much is going to fail down the line if suddenly NULLs become empty strings" and "if the value is not there it is truly unknown and ... that is a NULL, only NULL and nothing ... but NULL". This is why I agree with Jeff in disagreeing with the idea that the split function could, under some other interpretation, return NULL for any element. A NULL, as you stated, means "not there", hence could never have been part of the set that was joined together into a single, delimited string in the first place. If a set contains:

    Bob

    NULL

    Sally

    then the result of that set joined (as in String.Join(",")) together would be:

    Bob,Sally

    On the other hand, the following set:

    Bob

    {empty string}

    Sally

    would result in:

    Bob,,Sally

    Splitting either of those results on the comma would give you back the original set, with the exception, of course, of the missing element in the first set: the known unknown. If the reduced number of elements (due to NULLs not being represented in the CSV list) presents a problem for the operation, then either:

  • send separately the original number of elements (the diff between that and the result from the Split() would be the number of NULL elements)
  • don't allow the source string fields to be NULL
  • do an ISNULL / COALESCE such that NULLs are converted to an empty string or some other reserved designater that the down-stream code knows how to interpret
  • Basically, in terms of SQL Server (and not Oracle ;-)), splitting a string can only return a NULL if the input itself is NULL.

    Take care,

    Solomon..

    According to your logic what would a splitter return if the input string had a trailing comma?

    Bob,Sally,

    By no means am I saying I disagree with your sentiments just wondering. Is this a 2 element list or 3? Is the NULL there or is it not? Is the third element an empty string, a NULL or not returned?

    Hi Sean. In the case of:

    Bob,Sally,

    The result should be 3 records:

    Bob

    Sally

    {empty string}

    A NULL cannot be there as it was a non-NULL string that was split. And there are two delimiters so there are necessarily three elements to split. I find that this is the common interpretation as most of the set-to-single-string concatenations I have seen (usually using FOR XML PATH('')) strip off the trailing delimiter.

    Take care,

    Solomon...

    Colour me slow but somehow I am still struggling, so the set of MyFamily{Bob,(nothing),Sally} which we present in tabular format as (for the sake of argument) {Me,Brother,Sister}, when interrogated for content would then according to the previous logic, result in a swift and painless gender swap:

    SELECT * FROM MyFamily

    Me,Brother,Sister

    Bob,Sally

    Someone is not going to like this:angry:

    A missing value cannot alter the structure of a set, missing or not the element should and must still be there. The ambiguity implied by not being able to determine the difference of an empty value and a missing value is throwing things off target.

    😎

  • Eirikur Eiriksson (9/4/2014)


    Solomon Rutzky (9/4/2014)


    Hi Sean. In the case of:

    Bob,Sally,

    The result should be 3 records:

    Bob

    Sally

    {empty string}

    A NULL cannot be there as it was a non-NULL string that was split. And there are two delimiters so there are necessarily three elements to split. I find that this is the common interpretation as most of the set-to-single-string concatenations I have seen (usually using FOR XML PATH('')) strip off the trailing delimiter.

    Take care,

    Solomon...

    Colour me slow but somehow I am still struggling, so the set of MyFamily{Bob,(nothing),Sally} which we present in tabular format as (for the sake of argument) {Me,Brother,Sister}, when interrogated for content would then according to the previous logic, result in a swift and painless gender swap:

    SELECT * FROM MyFamily

    Me,Brother,Sister

    Bob,Sally

    Someone is not going to like this:angry:

    A missing value cannot alter the structure of a set, missing or not the element should and must still be there. The ambiguity implied by not being able to determine the difference of an empty value and a missing value is throwing things off target.

    😎

    Not slow at all. It is just an inherent problem of trying to represent "nothing" within something. This problem exists in XML so they created the xsi:nil="True" for Element-based XML (attribute-based XML simply uses the absence of the attribute to indicate a NULL). The problem exists for SQL Server in trying to persist the data in .mdf files which is why it takes an extra byte in the row header for NULLable (i.e. lacking in the NOT NULL specification) fields as there needs to be a place to indicate that the field is NULL since the NULL cannot be represented in the field itself. A simple CSV list of values is too simplistic of a format to handle this nuance. That is why I gave 3 alternatives if it was more important to keep the total number of elements than it was to distinguish between "truly unknown" and "known to be nothing". In your example data, it would have made more sense to use empty string as the placeholder for "Brother", at least when represented as text in a CSV list (even if the underlying data is NULL). So in your case you should take the ISNULL / COALESCE route as an empty string can be assumed to be NULL when reconstituting the CSV back into a set.

    But what about a form asking for someone's name? Maybe they fill our FirstName and LastName. If MiddleName IS NULL, that is not necessarily the same as an empty string. NULL would mean "has not yet answered" / "doesn't know" as opposed to empty string meaning "has answered and the answer is that this person has no middle name". That scenario cannot be easily represented in a simplistic CSV list. Which is fine because that is not what a simple CSV list is meant to do. Although if that is all someone has to work with, then the other option I mentioned of using a reserved string to indicate NULL would be the only way to allow for both empty string and NULL (e.g. something that would never be a valid middle name, such as '<<<NULL>>>'). But again, this is a benefit of using a more structured format such as XML that can handle this nuance, even if it is bloated.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • dwain.c (9/3/2014)


    Jeff Moden (9/3/2014)


    I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.

    I assume you meant something like this?

    SELECT ItemNumber, Item, NULLIF(Item, '')

    FROM dbo.DelimitedSplit8K(' ', ' ');

    Z'actly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I know I'm a little late to the discussion here, but I think it is so cool that a function posted years ago is still generating discussion and debate. There are those that say string splitting isn't necessary in today's world, but this discussion is proof that it certainly is. I know I use it extensively and recommend it to others.

    On this topic of what to return, I think that {Bob,,Sally} should return 3 rows, the second one being empty.

    I also believe that {Bob,Sally,} should return 3 rows, the last one being empty.

    If the developer chooses to eliminate the empty string from the result set, that's their own error. Errors are easy to make - ask any developer.

    And finally, even through I worked with Oracle for a number of years, a NULL and an empty string are fundamentally different and should never be equal.

  • Ed Wagner (9/5/2014)


    And finally, even through I worked with Oracle for a number of years, a NULL and an empty string are fundamentally different and should never be equal.

    Nor unequal, for that matter. Which is wherey a lot of people get confused.

    Tom

  • That took me by surprise the first time I worked with Oracle, as well. NULL same as '' not same as ' '.

    The other thing that took me by total surprise is that comments aren't totally ignored. If you put an "&" in the comments of a proc and execute it, it comes back and either asks you for a parameter or tells you one is missing.

    Of course, SQL Server is not without sin. It says right in BOL that the ORDER BY in a SELECT INTO will not guarantee the order of the IDENTITY(INT,1,1) function that you might also use but, guess what? There's no other place that function will actually work! Yes, you can use ROW_NUMBER() but WTH????

    Heh... MS should stop inventing new things, stop deprecating old things, and start making things work / adding practical things. For example, the CONNECT item to create a machine language level Tally Function or some other similar built-in has been open for 7 years (opened on 2/18/2007) and has had no action. At least they haven't closed it as "built to design" or "won't fix". It would also be nice if they provided a built in text splitter that would work the same for 1 or multiple character delimiters and NVARCHAR/VARCHAR of any size those two datatypes can handle without having to specify. It can't be that bloody difficult... tons of people have written short CLRs to do the same thing. Grab one of those and call it a day!!!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (9/5/2014)


    That took me by surprise the first time I worked with Oracle, as well. NULL same as '' not same as ' '.

    The other thing that took me by total surprise is that comments aren't totally ignored. If you put an "&" in the comments of a proc and execute it, it comes back and either asks you for a parameter or tells you one is missing.

    Of course, SQL Server is not without sin. It says right in BOL that the ORDER BY in a SELECT INTO will not guarantee the order of the IDENTITY(INT,1,1) function that you might also use but, guess what? There's no other place that function will actually work! Yes, you can use ROW_NUMBER() but WTH????

    Heh... MS should stop inventing new things, stop deprecating old things, and start making things work / adding practical things. For example, the CONNECT item to create a machine language level Tally Function or some other similar built-in has been open for 7 years (opened on 2/18/2007) and has had no action. At least they haven't closed it as "built to design" or "won't fix". It would also be nice if they provided a built in text splitter that would work the same for 1 or multiple character delimiters and NVARCHAR/VARCHAR of any size those two datatypes can handle without having to specify. It can't be that bloody difficult... tons of people have written short CLRs to do the same thing. Grab one of those and call it a day!!!

    My guess is that DelimitedSplit8K caught their attention and they haven't been able to figure out a way to write something faster.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Microsoft seems totally marketing driven these last years, everything else like accessibility and usability be damned. I find it really sad, it could be miles ahead of everything very easily if they had the right people making the calls.

  • To be fair, Microsoft did give us master data services which has a CLR udf splitter (an awful one) that still requires enabling CLR on so most DBAs just say no.

    As for varchar2 in oracle land, it actually is one of my favorite jokes about bad programming design.

    Let's say you write a string splitter called split but it doesn't work for certain delimiters like '&', because you used the xml datatype trick.

    Instead of fixing the issue, you should follow the oracle route.

    1. rename it to split2. Mark it as deprecated.

    2. create an alias for split2 with the name split

    3. Put out documentation saying that split2 doesn't work for a semicolon delimiter.

    4. never bother to create split, claiming that to do so would be considerable effort and would break backward compatibility.

    I'm hoping that microsoft gives us a tvf for string splitting but I'm not super hopeful. It would be the only freestanding tvf they would give us so it would be a new entity. I would also expect the ability to rename it like I can with xml datatypes. I'd also like a proper way to concat strings, not using xml path(''). I can't even write my own aggregate as there is no guarantee about order. (Microsoft is even worse then oracle, they create the attribute and the only way to find out it doesn't work is to read the docs!)

  • To be fair, Microsoft did give us master data services which has a CLR udf splitter (an awful one) that still requires enabling CLR on so most DBAs just say no.

    With MDS/DQS there are actually two CLR splitters: mdq.split and mdq.regexsplit. They are very similar and have a lot of functionality.

    All that said, I use delimitedsplit8K because it's faster (I actually tested it and posted the results earlier in this this thread) and does not require CLR integration. Ditto PatternSplitCM which I have found can be used to accomplish many tasks that people would normally turn to Regex for.

    Edit: used the wrong IFcode shortcut;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Jeff Moden (9/5/2014)


    Heh... MS should stop inventing new things, stop deprecating old things, and start making things work / adding practical things. For example, the CONNECT item to create a machine language level Tally Function or some other similar built-in has been open for 7 years (opened on 2/18/2007) and has had no action. At least they haven't closed it as "built to design" or "won't fix". It would also be nice if they provided a built in text splitter that would work the same for 1 or multiple character delimiters and NVARCHAR/VARCHAR of any size those two datatypes can handle without having to specify. It can't be that bloody difficult... tons of people have written short CLRs to do the same thing. Grab one of those and call it a day!!!

    I kinda sorta agree, but for the most part I am figuring that their reason for not adding these is the same reason why you think that they should, namely: it's not bloody difficult and tons of people have already written more-than-adequate versions in CLR. I am sure that the SQL Server team at Microsoft has a lengthy backlog of quick / simplistic functions to add where any one of them by themselves could be added and tested in less than a day. But how many do they add? Just the two that you mentioned? I am sure some other people think that some other function might be higher priority (though it certainly seems like splitting strings is a near-universal need). Adding Regular Expression functions should also be relatively easy, especially if they do what some other RDBMS's do and give just the most basic form of the function.

    In the end, isn't this why Microsoft gave us CLR Integration? This way we don't need to inundate them with all of these requests. I figure that Microsoft would rather spend their time on growing the data platform (SSIS, SSRS, SSAS, StreamInsight, ColumnStore Indexes, In-Memory OLTP / Natively Compiled Stored Procs, Hadoop integration, etc.) instead of adding lots of functions that will also take up time in terms of bug fixes, backwards compatibility in future versions, etc. Via CLR Integration, we can add hundreds / thousands of functions that they would never have time to do. There are a few drawbacks, of course, such as limitations in the SQL Server to CLR API that do not exist for their developers when working with the SQL Server source code: things like a slight performance hit to pass data through the API, the ability to overload functions so that different numbers of parameters can be used, etc. But for the most part it is a very flexible and functional framework.

    Regarding the two items you mentioned, Tally function and splitter, both are available for free in the SQL#[/url] library, along with quite a few others. There are several versions of the "tally" functionality: Util_GenerateDateTimeRange, Util_GenerateDateTimes, Util_GenerateFloatRange, Util_GenerateFloats, Util_GenerateIntRange, and Util_GenerateInts. There is also: String_Split and RegEx_Split. All of these (and all String, RegEx, and Util functions) work with the Assembly set to SAFE so there is no "good" reason to not use them, or write your own, outside of working with Azure SQL Database which does not support CLR Integration at this time.

    Take care,

    Solomon..

    EDIT:

    As far as I can tell, Microsoft is already showing reluctance to add new functions in their VC++ code (I assume it is C++) and are themselves relying on CLR Integration. Just run the following TRY_PARSE (which started in SQL Server 2012) in either SQL 2012 or SQL 2014:

    SELECT TRY_PARSE('2079-06-06 23:59:30' AS SMALLDATETIME)

    and you will get the following error:

    Msg 6521, Level 16, State 1, Line 2

    A .NET Framework error occurred during statement execution:

    System.InvalidCastException: Specified cast is not valid.

    System.InvalidCastException:

    at System.Data.SqlServer.Internal.CXVariantBase.SqlDateTimeToSmallDate(SqlDateTime sdt)

    at System.Data.SqlServer.Internal.CXVariantBase.DateTimeToSmallDate(DateTime dt)

    which does have an "active" CONNECT item.

    And the FORMAT command is also most likely straight from .Net; it has the same signature and behavior as the C# DateTime.ToString(Format, CultureInfo).

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • dwain.c (9/7/2014)


    Jeff Moden (9/5/2014)


    That took me by surprise the first time I worked with Oracle, as well. NULL same as '' not same as ' '.

    The other thing that took me by total surprise is that comments aren't totally ignored. If you put an "&" in the comments of a proc and execute it, it comes back and either asks you for a parameter or tells you one is missing.

    Of course, SQL Server is not without sin. It says right in BOL that the ORDER BY in a SELECT INTO will not guarantee the order of the IDENTITY(INT,1,1) function that you might also use but, guess what? There's no other place that function will actually work! Yes, you can use ROW_NUMBER() but WTH????

    Heh... MS should stop inventing new things, stop deprecating old things, and start making things work / adding practical things. For example, the CONNECT item to create a machine language level Tally Function or some other similar built-in has been open for 7 years (opened on 2/18/2007) and has had no action. At least they haven't closed it as "built to design" or "won't fix". It would also be nice if they provided a built in text splitter that would work the same for 1 or multiple character delimiters and NVARCHAR/VARCHAR of any size those two datatypes can handle without having to specify. It can't be that bloody difficult... tons of people have written short CLRs to do the same thing. Grab one of those and call it a day!!!

    My guess is that DelimitedSplit8K caught their attention and they haven't been able to figure out a way to write something faster.

    Thanks for the really nice compliment, Dwain. But, we both know that it doesn't handle the MAX data-types well and that a decent CLR will handle both VARCHAR and NVARCHAR instead of needing two separate splitters. And, a CLR will beat it quite handily for performance if written correctly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Solomon Rutzky (9/9/2014)


    In the end, isn't this why Microsoft gave us CLR Integration?

    Heh... yeah... just like they gave us the ability to write extended stored procedures. 😉 I wonder when they'll deprecate and discontinue SQLCLR.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 661 through 675 (of 981 total)

You must be logged in to reply to this topic. Login to reply