Changing nvarchar to varchar

  • I have a table with millions of rows in it. The character fields are nvarchar.

    Currently its only got a clustered index on an autoincrementing Bigint field. There is a requirement to start reporting on the table's contents.

    If  I add new non-clustered indices for reporting, its size will increase dramatically and inserts will slow down.

    I'm suggesting during a maintenance window dropping the table and re-creating with varchar fields instead of nvarchar. We don't need Unicode characters. Starting the new table seeded with the current Auto ID value.

    1. The table only ever contains a month's worth of data, its continuously being purged. So I could discard the current batch. I'm concerned running alter table alter column would take ages and fill the transaction log.

    2. Having smaller indexes, based on varchars would be beneficial to select query performance and insert performance?

    Am I right in making these assumptions, is it worth trying it.

  • You have to consider something else that most people don't and that is "Implicit Conversions".  You have to know how external code is referencing the table.  If it's passing string literals for criteria as Unicode, the VARCHAR() column it references must be converted on the fly which forces a full scan of either a non-clustered index or (in the very common worst case) a Clustered Index scan, which is basically a full table scan.  Such Unicode string literals used as criteria against VARCHAR() columns will NEVER do an index seek... only NCI Index Scans, at best.

    Until you know the nitty gritty deals of what I've spoken about above, I'll recommend that you must NOT do the conversion you're talking about because your queries could quite literally become thousands of times slower.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks Jeff that's a good point, which I do not know the answer. I don't have control over the code. I doubt the support Guy I'm talking to, would know.

    From what I can see its using JavaScript (I Think), I do know its using an ODBC datasource saved on the machine. Datasource is created using sql native driver.

    The table is being populated with big inserts statements I can see in activity monitor eg. INSERT INTO table VALUES (1, 'Bob'), (2,'Jane'),(3,'Steve')

    The basic reporting queries we've discussed are

    Select …... where DateAdded > '20200903' and DateAdded < '20200903' and type='error'

    I would have thought SQL would say 'error' is unicode I will convert to varchar ONCE because  Type field is varchar? rather than converting all of the index?

    Would doing a parametrised sp_executesql query force it to use varchars? Or perhaps a search SP?

     

     

     

     

  • In contract to @JeffModen's response - it is also possible that the client(s) are sending varchar parameters and all of those queries are using implicit conversions and full index scans.

    Before making the change, you need to determine what the parameters are that are being sent - whether nvarchar or varchar - and how much work to change that code if needed.

    You should also review and validate all stored procedures and functions.  Are the parameters defined as nvarchar or varchar - and if nvarchar do you have hard-coded varchar strings?

    Example: SELECT ... FROM ... WHERE col1 = 'one'  --uses varchar string and should be N'one'

    As for indexing, it won't matter on the data type unless parameters/hard-coded strings/etc... are varchar.  Indexes would not be used for nvarchar columns in those situations.

     

    Jeffrey Williams
    Problems are opportunities brilliantly disguised as insurmountable obstacles.

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Jeffrey Williams wrote:

    In contract to @JeffModen's response - it is also possible that the client(s) are sending varchar parameters and all of those queries are using implicit conversions and full index scans.

    Probably not.  Yes... there will be an implicit case in such  case but VARCHAR() is lower on the datatype precedence list and so only the the VARCHAR() literal being used as a passed criteria would suffer the implicit cast and an index SEEK can happen on the NVARCHAR() column.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden wrote:

    Jeffrey Williams wrote:

    In contract to @JeffModen's response - it is also possible that the client(s) are sending varchar parameters and all of those queries are using implicit conversions and full index scans.

    No.  It doesn't work that way in such a case.  Yes... there will be an implicit cast in such  case but VARCHAR() is lower on the datatype precedence list and so only the the VARCHAR() literal being used as a passed criteria would suffer the implicit cast and an index SEEK can happen on the NVARCHAR() column.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks Both

    I'm pretty certain that the front-end, will make the SQL up as a string (I know SQL INJECTION!) and using a DB lib (not sure which code lib) execute this against the DB. It will not be using something like ADO with a parameters collection. As its using ODBC I guess it will be quite basic.

    If I were to see this in a trace does this mean literals are varchar

    SELECT ... FROM ... WHERE col1 = 'one'

    From experience Queries I execute in M.Studio will use the index and plan will show something like SCALAR(N'one') if col1 is Nvarchar.

    There are no SPs or Functions. I would prefer a select SP so that I could control it. Not sure if they would be happy changing code.

    So if I were going to change the table's fields from Nvarchar to  varchar as long as they are not executing

    example: SELECT ... FROM ... WHERE col1 = N'one'

    it would use the index?

  • terry999 wrote:

    The basic reporting queries we've discussed are

    Select …... where DateAdded > '20200903' and DateAdded < '20200903' and type='error'

    With this kind of queries you need a clustered index on DateAdded. Autoincremental bigint ID is never a good candidate for a clustered index.

    That's what you should change in the table.

    And that change will solve all your performance issues, so you won't need to worry about nvarchar column and an index on it ever again.

    BTW, the column in question - is it [type]?

    If so - you better put the types in a separate table with tinyint ID (you don't have more than 256 different types, right?) and replace the [type] column with [TypeID] tinyint referencing that ID by a FK.

    _____________
    Code for TallyGenerator

  • You're right about Type but can't change it, would require code change.

    Why is an Autoincremental bigint ID a bad idea for a clustered index?

    If I were to make DateAdded a clustered index, I thought it would use a unique bigInt anyway because DateAdded is not unique.

    Also and not clear from my post, DateAdded isn't  the date added to This DB but the Date Added to another computer. In other words the DateAdded will not be added to this table in exact order.

    The table would be fragmented and wouldn't the index become a bottle neck?

  • Why is an Autoincremental bigint ID a bad idea for a clustered index?

    Telephone numbers are allocated to devices/SIM cards incrementally, in sequence. It's one of examples of auto-incremental ID's we can find al around us.

    Now, tell me, would it be a good idea to sort the records in Yellow Pages or any other kind of phone books in order of phone numbers? Why do you think it would be a stupid idea? Here is the answer on your question.

    Unique PK remains on the table, it's just becomes not clustered anymore.

    Any other index will have a reference to that unique key anyway, even if you don't mention it in the index definition.

    When a WHERE clause specifies a range between 2 close dates it's like looking for a telephone of Sarah Connor in a phone book - you open a page for all last names started with "Conn" (because the book is clustered by the last names) and then scan that one page searching for the right record. Not wasting any time on the rest of the book. If you'd have the book clustered by the numbers (like your table) then you'd need to go for a last name index at the back of the book and then open every page having another "Connor" on it trying to find records for Sarah (it's named "bookmark lookup).

    So, clustering the table by DateAdded allows the query to get a tiny slice of the table straight away just by the date range selection, and then checking for the rest of conditions within that small subset of data.

    As for fragmentation - you should not worry about it too much. It really depends on the way you load the data to the server from that another machine - I could tell you more if you'd share it here.

    But even if the data uploaded in not the perfect sequential order according to the dates it would be still generally correct: yesterday's data would end up on the paged added to the table today, sitting all within the same cluster of pages. And today's records will end up in the cluster added tomorrow. So, when you query for records between today and tomorrow you're reading pages sitting next to each other. Whatever order imperfection may exist within those pages does not matter much, as you're gonna scan all of them for the rest of WHERE conditions anyway. If fragmentation adds 10%, or 20%, or even 100% overhead to the scan it still adds to 0.1% of the load you have today. You'll never notice it.

    And - make sure you drop that index defragmentation job immediately after you've changed the clustered index. At least exclude this table from it.

    _____________
    Code for TallyGenerator

  • Hi Sergiy

    Thanks for your reply. I do appreciate it as I learn.

    Happy because:

    On an ANOTHER project where I was in full control i.e. loading data and reporting. I adopted your suggested approach i.e. Clustered index on non unique DateSold field and unique non-clustered on auto ID field. I always had a doubt if I got this right. I wrote the reports so every report filtered on DateSold and I controlled the data import which happened out of hours so I could reindex after an import.

    Subject of Orig Post:

    The Data are logs from Many different machines (1000s). The table was actually installed as a heap. The Table is 98% of the DB size. It was growing v.large I needed a way of purging it without getting Vendor to change their insert code. I added the autoID field so I might uniquely identify a row to purge . The purge doesn't get in the way of the constant inserts as its removing data with the smallest IDs. I see it as removing from the bottom adding at the top.

    Now reporting is going to be used.

    Data is imported all the time. Most of the Data is within a few days of each other however we do get stuff daily that is months out of date e.g. machines not decommissioned properly or service not running and lots of other issues. I didn't want to purge on a DateAdded because these machines' logs would be added and deleted in the same day after the purge runs.

    Irrelevant really as noone was actually checking this data. I want to propose truncating the table making any necessary changes so we have a workable reporting soln. and restarting the import, down time will be minutes.

    Getting the right soln.

    If all reports have a date range then excellent Clustered index on DateAdded . Non-Clustered on autoID I don't think would cause contention/locking when purge works.

    But.. If some reports don't filter on Date they wouldn't benefit from the new index design. I guess I would have to get a ranking of the important/most used reports are. Much easy said than done.

    I've got Query Store switched on but pointless because noone is using the reporting. I can see inserts are taking alot of Server time and this would probably benefit from using Bulk insert into a staging table then into live table. This would be a Vendor code change - not possible.

     

     

  • Sergiy wrote:

    Autoincremental bigint ID is never a good candidate for a clustered index.

    "Never" is a pretty big word here.  There are some exceptions.  For example, you're using the Autoincremental ID as  a necessary surrogate key like for CustomerID, EmployeeID, etc or using the Clustered Index as a storage control mechanism rather than as a performance mechanism.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • But.. If some reports don't filter on Date they wouldn't benefit from the new index design.

    I bet there is no query which benefits from the current index design.

    Well, the inserts don't cause significant fragmentation of the clustered index due to this design - and that's the only advantage.

    Which is pretty useless, because no SELECT can use this advantage. Simply because nobody would ever be interested in selecting records located within a range of artificially created meaningless numeric ID.

    You can only select record for a specific ID, or may be a bunch of specified ID's - but they would not be sequential. Considering this - any level of fragmentation of the clustered index would not matter at all. So, achieving a continuous clustered index gives you nothing, except, probably, that fuzzy warm feeling from knowing the index is continuous.

    So, by changing the clustered index from ID to DateAdded you won't lose performance on any query. You might have (or will have) some queries which won't benefit from this change. But with majority of the queries based of date selection getting completed with extremely short duration, you have the whole server free to deal with those odd queries.

    _____________
    Code for TallyGenerator

  • terry999 wrote:

    So if I were going to change the table's fields from Nvarchar to  varchar as long as they are not executing

    example: SELECT ... FROM ... WHERE col1 = N'one'

    it would use the index?

    If the Col1 column is VARCHAR(), then the answer is "the only thing it could use an index for is an index scan" because the entire VARCHAR() column would first need to be converted to NVARCHAR() behind the scenes because the Unicode string literal (NVARCHAR() will be used) has a higher datatype precedence than the column being compared to.

    If we just consider the raw basis of your question, the answer is "NO".

    This is an ages-old problem with ORMs, BTW.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply