Finding columns with null values

  • ScottPletcher wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    Btw, why are you using such an inefficient method, checking the column values one by one?

    What are you asking why instead of posting a link to something better? 😀

    Then why are you asking me that instead of posting something better yourself? 😀

    I'm trying to keep you from becoming like a famous book author that we both know. 😉

    Seriously... making a such claim without substance isn't like you, Scott.  You didn't even offer a reason as to why it's inefficient never mind the server impact on future statistics rebuilds. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • ScottPletcher wrote:

    I asked because many people are happy with code as long as it ultimately does what they need it to do, i.e. "the code's good enough for me."  In that case, it's a waste of my time to code an improved method.  I don't have an existing link to reference (all code I've seen on the internet is just as inefficient, it really seems they all copied from the same original method), so I'd have to code it myself (or modify some code I've done for work here).  I've got tons of stuff to do right now, so if OP is good with the method they have, I'm not gonna post code they are very likely to ignore anyway.

    I totally agree there.  I just spent too long looking for a decent article or even a forum answer on the subject.  Many of the answers are just like you say... they're identical to the code the OP is using.

    Shifting gears to the code itself, it's pretty much a both a resource and performance nightmare.  It makes one full pass of either the CI or an NCI for each column investigated.  What's really bad is that it has a WHERE clause for every bloody column, which means that it will add statistics for virtually every column of the table.  That, in turn, will kill performance and add still more resource usage to future statistics rebuilds.

    I banged out some code so solve most of that.  I have to add a flower box and add the option to equate empty strings and blanks to NULLs but it does do everything it needs to do in a single pass of the HEAP or CI and it does so without building statistics for every column that doesn't already have them.  That latter part is really important if you have some really bad non-normalized table designs in the database.  We have some insane tables that are very heavily used that are approaching 150 columns wide.  I can see something like that for an import staging table but we have a whole lot more permanent tables in that condition than I'd like.  In fact, I tested the code below on one that's 147 columns wide.  Only 20 of the columns have the NOT NULL constraint set and a lot of them are fairly well sparsely populated (I've got Rev 02 for the code below in mind for that).

    Anyway, here's the working proof of principle.  It's pretty easy code when you think about it.

    --===== This could be a parameter in a stored procedure
    DECLARE @pQualTableName NVARCHAR(261) = N'dbo.Loans'
    ;
    --===== Define the local variables.
    -- They all need to be NVARCHAR(MAX) to prevent truncation during the REPLACEs coming up.
    DECLARE @ColCnt NVARCHAR(MAX) --Will contain the SELECT list of columns and COUNTs.
    ,@Unpvt NVARCHAR(MAX) --Will contain the column list for the CROSS APPLY for the unpivot.
    ,@SQL NVARCHAR(MAX) --The main Dynamic SQL Template that we'll populate with the above.
    ;
    --===== This gets all of the dynamic parts of the SQL that we need.
    -- Note that COUNT(SomeColName) only counts non-NULL values and will only return a "0" if ALL the values
    -- in the column are NULL. Also note that "column order" doesn't matter here at all.
    SELECT @ColCnt = ISNULL(@ColCnt+NCHAR(10)+N',',N' ') --Old "Black Arts" method to kill first comma and add C/R.
    + CONCAT(QUOTENAME(name),N' = COUNT(',QUOTENAME(name),N')')
    ,@Unpvt = ISNULL(@Unpvt +NCHAR(10)+N',',N' ') --Old "Black Arts" method to kill first comma and add C/R.
    + CONCAT('(''',QUOTENAME(name),N''',',QUOTENAME(name),N')')
    FROM sys.columns
    WHERE object_id = OBJECT_ID(@pQualTableName)
    AND is_nullable = 1 --No need to check non-nullable columns.
    ;
    --===== This is the Dynamic SQL Template that we'll populate
    SELECT @SQL = N'
    WITH cte AS
    (
    SELECT
    <<@ColCnt>>
    FROM <<@pQualTableName>>
    )
    SELECT unpvt.ColName
    FROM cte
    CROSS APPLY (VALUES
    <<@Unpvt>>
    )unpvt(ColName,ColCnt)
    WHERE unpvt.ColCnt = 0;'
    ;
    --===== We have all the pieces... Build the Dynamic Query.
    SELECT @SQL = REPLACE(REPLACE(REPLACE(
    @SQL
    ,N'<<@pQualTableName>>',@pQualTableName)
    ,N'<<@ColCnt>>' ,@ColCnt)
    ,N'<<@Unpvt>>' ,@Unpvt)
    ;
    --===== Produce the list of only those columns that contain only NULLs
    EXEC (@SQL)
    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • As a bit of a sidebar, the table I tested it against has, like I said, 147 columns.  It's kind of a small table weighing in at only 8.4GB and 10.7 million rows.  I guess considering the 147 columns it works on, the run time of the code is 1 minute and 4 seconds.  Not sure how to make that faster but it is only doing a single scan of the CI.

    And I keep thinking of some simple but useful things to add to the code like calculating how much space would be saved if the empty columns were dropped, etc, etc.  It would be fairly trivial to also have it generate the code to do the column drops followed by a rebuild of the CI.  Of course, I wouldn't execute the drop code... such a thing should always be reviewed before being executed.  Any necessary tweaks would be in the form of just dropping or commenting out some code.

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden wrote:

    Shifting gears to the code itself, it's pretty much a both a resource and performance nightmare.  It makes one full pass of either the CI or an NCI for each column investigated.

    I don't think so; I think it would have to read only to the first non-NULL for each column.  Of course, for a given column, if all values were NULL, then it would require a full scan of the table for those columns.  And it might require half-scans or anything in between, just depending on where the first NULL value occurred.  Ugh, yeah, definitely still not an efficient query at all.

    And I really am curious as to why so many people copy such code it and use it without realizing the "ugh" part and re-doing the code.  I have in-house versions of the code, although I include a COUNT(*) and retain the count of non-null for each column (no columns are left out).  If you have to scan the table anyway, might as well.  Optionally, can request a datetime be added to the result, so you know when the count is from.

    The problem with my code is I have many custom things added for specifics related to our tables and our naming/rules and abbreviations, which I'd have to strip out/adjust before posting the code.  And my procs are in the master db -- so they can be used on any table in any db -- and some people don't care for that.  It also means the proc name starts, perforce, with sp_, so to use in a single user db I'd need to change the proc name.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • I'm with ya on having some good stuff in the Master database.  I agree that a lot of people poo-poo such things but it works incredibly well especially if you're careful with naming.  I also keep a script handy to restore all the things I have in the Master database, just in case.

    As to why people just copy'n'paste stuff with no understanding... the action itself explains it... they have no understanding to begin with and just want to "get something off their plate".  When you ask them why they don't understand, you get the song'n'dance that the company doesn't provide for training and so they take no initiative on their own after hours because the have a "work/life" balance that says work is 9-5, period, or some other reasoning not to invest any personal time.  Of course, I think you knew I was going to say that because I think you may be of the same ilk as me on that subject. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Back to Jeff's point, the code as written will only return a hit if ALL rows are NULL for a particular column.   If even one row has a non-null value, then it will fail the NOT EXISTS test.

    IF NOT EXISTS (SELECT top 1 * FROM NAP.OPERATION WHERE [' + @col + '] IS NOT NULL)

    If the intent is to report any columns that have NULL values for any rows, the code should be

    IF EXISTS (SELECT top 1 * FROM NAP.OPERATION WHERE [' + @col + '] IS NULL)

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I think the OP does want to find only columns that are all NULL.  Perhaps to remove the column?

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Perhaps.    I was just going by his language that says the columns contain nulls.     I don't think he ever used the words "all" or "only".

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Yep... I was looking at his code, as well.  It appears that the OP wants only columns that contain nothing but nulls.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • The Dixie Flatline wrote:

    Perhaps.    I was just going by his language that says the columns contain nulls.     I don't think he ever used the words "all" or "only".

    I was going by the fact that OP has been using this code for some time and it does what they want.  Presumably by now they know its checking for all nulls and not any null.

    "Which columns have null values" could be taken as "have all null values", esp. if you're thinking that way when you write it.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply