How to determine what percent of a varchar column is a valid numeric?

  • I have a data profiling type of question :
    I'm interesting in a query to determine what percent of a varchar column is a valid numeric?
    sql*server 2012.
    i.e. the varchar column may hold a value like 123.4 or a single blank or a value like 'NA' ..
    From the data profiling perspective, this might be something that is a common type of requirement.
    If anyone is willing to share .. if they use anything similar ?

  • Personally, I don't put numbers in varchar if I want them to be numbers. I put numbers in numbers columns. That way, no confusion. However, what you're looking for is this function: ISNUMERIC. You should be able to pretty easily write a query that gets a count of the table and a count of the table where the VARCHAR columns return true from that function. Compare the two to arrive at a percentage.

    Piece of advice, dates go in date data types, numbers go in numbers, strings in strings. Don't mix them up for any reason or you pay a very high cost later.

    ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

  • Grant Fritchey - Wednesday, February 27, 2019 1:40 AM

    Personally, I don't put numbers in varchar if I want them to be numbers. I put numbers in numbers columns. That way, no confusion. However, what you're looking for is this function: ISNUMERIC. You should be able to pretty easily write a query that gets a count of the table and a count of the table where the VARCHAR columns return true from that function. Compare the two to arrive at a percentage.

    Piece of advice, dates go in date data types, numbers go in numbers, strings in strings. Don't mix them up for any reason or you pay a very high cost later.

    Careful now... things like 8E2, a single comma, multiple commas (for example, ',,,,' will return a 1 for ISNUMERIC()), and a raft of other stuff is considered to be valid numerics because of potential conversions to engineering data types, the Money data type, and more.

    I DO , however, agree that things like "NA" or "N/A" are formatting junk for reports and someone has made the serious mistake of combining the presentation layer with the data layer if they've done such a thing.  That really needs to be fixed because it's causing a world of hurt right now and will only get worse as time wears on and people and code becomes more and more dependent on this mistake being present.  It's also killing performance and wasting disk space and memory.  If it's a big table, then it's wasting a lot.  The indexes are probably taking a beating for page splits due to "ExpAnsive" updates, which causes  unnecessary fragmentation, reduction in page density (waste of memory and disk space again), and a shedload of totally unnecessary log file activity which further impacts performance. 

    People will also say they already have too many dependencies on the data to fix it now.  Ladies and Gentlemen, THAT's the best reason to fix it because THAT means that you have those same problems all over the place!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • allan.ford17 - Tuesday, February 26, 2019 7:38 PM

    I have a data profiling type of question :
    I'm interesting in a query to determine what percent of a varchar column is a valid numeric?
    sql*server 2012.
    i.e. the varchar column may hold a value like 123.4 or a single blank or a value like 'NA' ..
    From the data profiling perspective, this might be something that is a common type of requirement.
    If anyone is willing to share .. if they use anything similar ?

    What, in particular, do you consider to be "numeric"???  You have the obvious thing of "123.4" but is "1" valid (something with no decimal point)?  Is a single period valid?  Is a number like ".4" valid or should it be "0.4"?

    As for a "data profiling perspective", see my note in response to what Grant wrote above.  Your "data profile" is broken and it really needs to be fixed instead of wasting your time with trying to separate the wheat from the chaff.  The ONLY exception to this "rule" is if it's data that you're receiving from a third party and you're trying to "profile the data" to fix it in the manner I identified in that previous response.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • That's one of the reasons SQL allows NULL "values" in columns (NULL is technically not a value, it's the absence of a value; for day-to-day purposes, you can think of it as "unknown").

    So, define the column as numeric and store non-numeric values as NULL.  You can add a separate flag column for that column if you want to be able to distinguish blank from 'NA', etc..

    For example:

    value decimal(9, 3) NULL
    value_flag tinyint NOT NULL /*0=value is numeric; 1=value was blank; 2=value was 'NA'; etc.*/

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Here's an example on how to get a percentage of data that is not numeric. It can also help with all the conversion to numeric columns that has been suggested by others and I totally support.

    --I'm creating a table with sample data
    SELECT TOP 1000
      CAST(RAND(CHECKSUM(NEWID()))*1000 AS varchar(25)) SomeColumn
    INTO #SampleTable
    FROM sys.all_columns x, sys.all_columns y;

    --Converting some numeric values into non-numeric
    UPDATE #SampleTable
    SET SomeColumn = 'NA'
    WHERE SomeColumn LIKE '1%';

    --TRY_CAST will convert to NULL all values that are not valid for the conversion. NULLs will be ignored by the COUNT function. The rest is a basic percentage formula
    SELECT COUNT(TRY_CAST( SomeColumn AS float)) * 100. / COUNT(*)
    FROM #SampleTable;
    GO
    DROP TABLE #SampleTable

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Sometimes you just have to deal with entity-attribute-value tables that were created over a decade ago with dozens of filtered indexes...  🙁

    I've used something similar to the following in the past as a general wag:
    SELECT    CAST(COUNT(CASE WHEN st.SomeColumn NOT LIKE '%[0-9]%' THEN 1. END) / CAST(COUNT(*) AS float) * 100.00 AS decimal(10,3)) AS NumericPercentage
    FROM    #SampleTable AS st
    ;

  • Lots of excellent advice and answers .. thank you all  .. 

    Appreciate warnings from Jeff about the ISNUMERIC function.

    For what I wanted .. rough data profiling .. the ISNUMERIC function should be good enough ..
    I used :

    SELECT  CAST(COUNT(CASE WHEN ISNUMERIC(st.[Some Column Name])=1 THEN 1. END) / CAST(COUNT(*) AS float) * 100.00 AS decimal(10,3)) AS NumericPercentage
    FROM  SomeTable AS st

    Yes- I have opportunity to redesign table (not my design) and will carefully choose data types !

    The source of the data is a CSV and the CSV landing table has all columns as varchar(50) ..

  • allan.ford17 - Wednesday, February 27, 2019 3:11 PM

    Lots of excellent advice and answers .. thank you all  .. 

    Appreciate warnings from Jeff about the ISNUMERIC function.

    For what I wanted .. rough data profiling .. the ISNUMERIC function should be good enough ..
    I used :

    SELECT  CAST(COUNT(CASE WHEN ISNUMERIC(st.[Some Column Name])=1 THEN 1. END) / CAST(COUNT(*) AS float) * 100.00 AS decimal(10,3)) AS NumericPercentage
    FROM  SomeTable AS st

    Yes- I have opportunity to redesign table (not my design) and will carefully choose data types !

    The source of the data is a CSV and the CSV landing table has all columns as varchar(50) ..

    Heh.. THAT is the exception that I was talking about.  Landing/Staging table for input preprocessing.  You have the idea to make the changes to the final destination table.  Thanks for the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply