Int versus Char primary key performance

  • My company has a scenario where we would like to change the data type

    of an existing primary key from an integer to a char, but we are

    concerned about the performance implications of doing so.  The script

    for the two tables that we need to modify is listed below.  Table

    FR_Sessions contains a column named TransmissionID which is currently

    an integer.  This table contains about 1 million rows of data.  Table

    FR_VTracking table also contains the TransmissionID as part of it's

    primary key and it contains about 35 millions rows of data.  These two

    tables are frequently joined on TransmissionID (FR_Sessions is the

    parent).  The TransmissionID column is used primarily for joins and is

    not typically displayed.

    We need to change the TransmissionID data type from int to

    char(7), and I had a few questions:

    1) Would this introduce significant performance degradation?  I have

    read that char keys/indexes are slower than int/numeric.

    2) Are there collation options (or any other optimizations) that we

    could use to minimize the performance hit of the char(7)...if so which

    ones?

    I am a software architect by trade, not a database guru, so please go

    easy on my if I overlooked something obvious

    Any suggestions or information would be greatly appreciated.

    Thanks,

    Tim

    -------------------

    CREATE TABLE [FR_Sessions] (

            [TransmissionID] [int] IDENTITY (1, 1) NOT NULL ,

            [PTUID] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,

            [PortNum] [numeric](6, 0) NOT NULL CONSTRAINT [DF_FR_Sessions_PortNum]

    DEFAULT (0),

            [CloseStatus] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL

    ,

            [RecvBytes] [int] NULL ,

            [SendBytes] [int] NULL ,

            [EndDT] [datetime] NULL CONSTRAINT [DF_FR_Sessions_EndDT] DEFAULT

    (getutcdate()),

            [LocalEndDT] [datetime] NULL ,

            [TotalTime] [int] NULL ,

            [OffenderID] [numeric](9, 0) NULL ,

            [UploadStatus] [char] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NOT

    NULL CONSTRAINT [DF_FR_Sessions_UploadStatus] DEFAULT ('N'),

            [SchedBatchID] [numeric](18, 0) NULL ,

            [SWVersion] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

            [DLST] [bit] NULL ,

            [TZO] [smallint] NULL ,

            [Processed] [bit] NOT NULL CONSTRAINT [DF_FR_Sessions_Processed]

    DEFAULT (0),

            [CallerID] [varchar] (13) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

            [PeerIP] [varchar] (16) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

            [XtraInfo] [varchar] (1024) COLLATE SQL_Latin1_General_CP1_CI_AS NULL

    ,

            [IdType] [char] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

            CONSTRAINT [PK_FR_Sessions] PRIMARY KEY  CLUSTERED

            (

                    [TransmissionID]

            ) WITH  FILLFACTOR = 90  ON [PRIMARY]

    ) ON [PRIMARY]

    CREATE TABLE [FR_VTracking] (

            [TransmissionID] [int] NOT NULL ,

            [FrameNum] [int] NOT NULL ,

            [LatDegrees] [float] NOT NULL ,

            [LonDegrees] [float] NOT NULL ,

            [Altitude] [float] NOT NULL ,

            [Velocity] [float] NOT NULL ,

            [NumPositions] [smallint] NOT NULL ,

            [NavMode] [smallint] NOT NULL ,

            [Units] [smallint] NOT NULL ,

            [GPSTrackingID] [numeric](18, 0) IDENTITY (1, 1) NOT NULL ,

            [dtStamp] [datetime] NULL ,

            CONSTRAINT [PK_FR_VTracking] PRIMARY KEY  CLUSTERED

            (

                    [TransmissionID],

                    [FrameNum]

            ) WITH  FILLFACTOR = 90  ON [PRIMARY]

    ) ON [PRIMARY]

  • The issue of char vs. int is simple:

    1. storage size

    2. the low level comparison that goes on

    #1 is straight forward: int is 4 bytes and char = whatever.  The effect here is in # of index pages required to reference everything and this depends on what other columns you're including in the index... then do the math

    #2 is a little more abstract.  I suppose that if you're using binary sort order (case sensitive/ a=a only as opposed to a=A aka insensitive) they're probably almost the same speed int ~= char(4). There's a tiny hit to how the two are compared (extra processor work).

    Factor #1 is all I ever worry about because your greatest hit in performance is loading/swapping pages around.  Second to that is jumping around on the pages, and then finally comparing data.

    Why not do a quick test and see what works best?

    Most often for me, I let the business requirements drive things.  If the keys aren't going to get too huge and the key data is something the user is aware of, and it is meaningful to them (or by rule assigned by them), I let it be char.  If it is a system assigned thing, or a behind the scenes thing (surrogate keys maintained by the app), I make it INT.

    No doubt there's an indepth article somewwere on this site...

  • I personally don't think that you are going to persive a big hit because of that change. I would worry if you were to use those columns for range , ordered or arithmetic searches (you may need to zeropad things and fix some "between" queries etc..) but if you are not using those in that way i wouldn't be concerned.

    The size difference in this case, like expressed above, is negligible as long as you keep the indexes

    Just my $0.02

     

     

     


    * Noel

  • If the queries you mention are run so often that this is an issue then why not keep the original identity key and add the new varchar() column and add an index for it (if needed)? That way the original queries you use are not changed and if you have any queries that require the new column then its indexed so you are ok with that as well. The only problems you run into then are the sixe of the new index (which could be a factor), the time used when inserting into the table (as it has to update the new index as well) and that you have to enforce uniqueness for the new column perhaps.

    It kinda goes against 'good' database design as you now have two keys (sort of) for the same table but sometimes you have to go that route when modifying an existing system and do not have the time/luxury of making a possibly global change.

    Cheers

    TNS

    SQL noob

  • Okay, completely ignoring the logical aspects here.

    john has already hit the nail right on the head here. Even if you use a CHAR(4) I suspect sorting and/or joining operations to be faster on an INT column. Even true when using a binary collation.

    So, from a pure performance point-of-view I would stick with the INT PK. However, not everything in a RDBMS is all about performance and so you might identify valid reasons to modify your structure. But this depends on your requirements and your specifical environment. It's hard to judge on this from afar.

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Another persistent myth of the database world.  Make no mistake about it, it IS a myth that character data types necessarily offer inferrior performance to numeric data.  Your search arguments are converted to binary values and then compared with the binary values stored in the database, so assuming the same length of data (int or char(4)) there will be little or no difference in most cases.

    Long ago I put this to the test.  It's really quite easy to do.  Create a table with three columns; an integer, a char(4), and a varchar(4)

    Now populate the table by declaring a loop an incrementing an integer value.  Populate the integer with the counter value and populate the other columns by converting that value to a character string.  (I think I put about 2 million rows in mine)  This makes sure that all the rows have the same selectivity (unique) and that without indexes will force a full table scan.  Next, turn on STATISTICS IO and STATISTICS TIME, you can also use SHOWPLAN_ALL.  Do some simple queries against each column looking for a single value.  You will notice that the queries against the character data do take additional time to compile.  I think this is due to SQL needing to sort out character sets and sort order, but once the query plan is compiled the actual execution time is identical (or nearly so) to the integer data type.

    Now to make it a more true-to-life test add a nonclustered index to each column and run the queries again.  Now you will see that there is no meaningful, or consistent, difference in either the compile time or the execution time between the character and integer data types. 

    Now the difference between a varchar(255) and an integer column will probably be significant because the number of reads goes up...

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • And my 2 cents:

    Don't forget: a char(n) is represented internally as an array of n bytes.

    So to make a comparison SQLServer has to do a loop and compare every bytes one after the other until the corresponding bytes are the same.

    Therefore even the algorithm is highly optimised it is n times slower then a single integer comparison.

    If you change your datatype from int to char(7) is is even worse bacasue you are nearly doubling the size of the array (and also the database size, therefore you can store less records within a page, so you need more disk access to read your data)



    Bye
    Gabor

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply