Referential Integrity vs. Performance

  • I haven't posted in this forum for a while so please forgive me if my post question isn't in the proper format. I'm also new to SSMS and haven't figured out how to create table relationships yet (my previous job used DBArtisan). Otherwise I would have shown the create table statements. I am in the process of figuring this out, but my boss wants an answer to this question as soon as possible.  Thank you for your patience.

    Scenario
    My development team is wondering about the necessity for static data/support tables as they pertain to referential integrity vs. the overhead of the join required to access the static data/support tables.

    Given the following two somewhat fictitious two examples:

    Scenario #1
    We have a voting application where users can vote on Issues. In this scenario a user can cast one of three possible values for his/her vote. FOR, AGAINST, ABSTAIN

    Tables
    TheVote
    ID - int
    Topic - nVarChar(3000)
    UserID - Int
    TheVoteCastByTheUser - Int - This field relates to the VotingOptions table detailed below.

    VotingOptions,
    ID - Int
    VotingOption - VarChar

    The VotingOptions table has three rows
    Row 1 - VotingOption = FOR
    Row 2 - VotingOption = AGAINST 
    Row 3
    - VotingOption = ABSTAIN

    Question:
    Some of the developers believe that we don't need the VotingOptions table and that the three VotingOptions of FOR, AGAINST, ABSTAIN can be handled in code.  They believe that the field TheVoteCastByTheUser in table TheVote should be a VarChar that the code sets to FOR, AGAINST or ABSTAIN (or a value of F, A, AB to represent those values).  They feel the overhead of joining to the VotingOptions table is not worth the referential integrity benefit. Some of us believe that although the contents of the VotingOptions table will never change, that proper design requires the use of the VotingOptions table.

    Your thoughts?

    Scenario #2
    The application has customers.  Each of the Customers has a status that indicates if the Customer is Active, InActive or Suspended.  Like the scenario above we have a static data/support table that contains the CustomerStatus that the Customer table refers to. Unlike the previous scenario from time to time the CustomerStatus could add other status values besides Active, InActive and Suspended. 

    We are having the same debate.  Some believe that even thought the CustomerStatus values may change from time to time, that the overhead of joining this table isn't worth the performance hit and that the table should be eliminated and instead its functionality handled in code.

    Your thoughts?

    Thank you in advance for your patience.  I hope that this is clear.  The bottom line is that we need to know if the overhead of joins for these static data/support tables are necessary when there are only a few records in these tables and that the programmer can easily simulate these values in code.

    Thanks again.

  • Hard coding the values into the application forces a redeployment any time there is a change in the VotingOptions (not likely, but what if there is a change such as adding 'Concur') or CustomerStatus.  Putting this information in a lookup table and using RI to validate input makes these changes trivial, just add a new row of data to the lookup table.  It also prevents values from being removed that are still in use.

  • There is no reason to use anything more than a BIT column to record the vote value.   1 = FOR, 0 = AGAINST, NULL = ABSTAIN.   Then add a 2nd BIT column to indicate 1 = Voted, 0 = Did not vote.  Then add the CustomerID.   The customer table should most definitely have a StatusID value that JOINs to the rather small table with the status values.   That kind of performance hit is probably not worth even thinking about unless you are going to write really bad queries.

    That's my two bits...  :hehe::hehe::hehe:

  • As an option one could use a check constraint instead to enforce the business rule however in the scenario described a lookup table is a better option as, as you indicated, these statuses aren't static. From my completely unbiased DBA point of view, referential integrity is the way to go in this case.

  • Thanks for getting back to me. I tend to agree with you. I thought it best to have the support table for referential integrity despite the performance hit for the join. Now I have to sell this to the rest of the team.

  • Thanks. The a bit of good information (sorry for the pun about using bits).

  • Here's a little test harness you can use for testing...
    USE tempdb;
    GO

    CREATE TABLE dbo.LookUpValues (
        id INT NOT NULL
            CONSTRAINT pk_LKV PRIMARY KEY CLUSTERED,
        value_desc VARCHAR(20) NOT NULL
        );
    GO

    INSERT dbo.LookUpValues (id, value_desc) VALUES
        (1, 'FOR'), (2, 'AGAINST'), (3, 'ABSTAIN');
    GO

    CREATE TABLE dbo.FK_Test (
        id INT NOT NULL
            CONSTRAINT pk_FKTest PRIMARY KEY CLUSTERED,
        fk_value INT NOT NULL
            CONSTRAINT fk_FKTest_fkvalue FOREIGN KEY REFERENCES dbo.LookUpValues(id),
        some_random_column INT NOT NULL
        );
    GO

    CREATE TABLE dbo.CK_Test (
        id INT NOT NULL
            CONSTRAINT pk_CKTest PRIMARY KEY CLUSTERED,
        ck_value VARCHAR(7) NOT NULL
            CONSTRAINT ck_CKTest_ckvalue CHECK (ck_value IN ('FOR','AGAINST','ABSTAIN')),
        some_random_column INT NOT NULL    
        );
    GO

    CREATE TABLE dbo.CommandoTest (
        id INT NOT NULL
            CONSTRAINT pk_CmdoTest PRIMARY KEY CLUSTERED,
        ck_value VARCHAR(7) NOT NULL,
        some_random_column INT NOT NULL
        );
    GO

    /*
    SELECT * FROM dbo.FK_Test ft;
    SELECT * FROM dbo.CK_Test ct;
    SELECT * FROM dbo.CommandoTest ct;
    */
    --==========================================================================================

    IF OBJECT_ID('tempdb..#InsertData', 'U') IS NULL
    BEGIN    -- DROP TABLE #InsertData;
        CREATE TABLE #InsertData (
            id INT NOT NULL PRIMARY KEY CLUSTERED,
            int_val INT NOT NULL,
            char_val AS CASE int_val WHEN 1 THEN 'FOR' WHEN 2 THEN 'AGAINST' WHEN 3 THEN 'ABSTAIN' END PERSISTED,
            some_random_column int NOT NULL
            );

        WITH
            cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
            cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
            cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
            cte_Tally (n) AS (
                SELECT TOP (1000000)
                    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
                FROM
                    cte_n3 a CROSS JOIN cte_n3 b
                )
        INSERT #InsertData (id, int_val, some_random_column)
        SELECT
            t.n,
            ABS(CHECKSUM(NEWID())) % 3 + 1,
            ABS(CHECKSUM(NEWID())) % 3333333 + 1
        FROM
            cte_Tally t;
    END;
    GO

    --==========================================================================================

    SET NOCOUNT ON;
    GO

    GO
    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
    PRINT(CONCAT(CHAR(13), CHAR(10), N'   wait a moment...', CHAR(13), CHAR(10)));
    WAITFOR DELAY '00:00:01';
    GO
    -- SET STATISTICS XML ON;
    GO
    DECLARE @_clock_start DATETIME2(7) = SYSDATETIME(), @_test_name NVARCHAR(70) = N'FK_Test ';
    PRINT(CONCAT(N'+', REPLICATE(N'-', 148), N'+
    ¦', LEFT(CONCAT(N'    Start Time: ', @_clock_start, N'   Test Name: ', ISNULL(NULLIF(@_test_name, N' '), N'Test Query 1'),
    REPLICATE(N' ', 100)), 148), N'¦', CHAR(13), CHAR(10), N'+', REPLICATE(N'-', 148), N'+')); DECLARE @_data_dump_int INT; SET STATISTICS IO ON;
    -- ____________________________________________________________________________________________________________________________________________
    -- ?????? place tsql here ????????????????????????????????????????????????????????????
    INSERT dbo.FK_Test (id, fk_value, some_random_column)
    SELECT
        id.id,
        id.int_val,
        id.some_random_column
    FROM
        #InsertData id;
    -- ?????? place tsql here ????????????????????????????????????????????????????????????
    -- ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
    DECLARE @_clock_stop DATETIME2(7) = SYSDATETIME(); SET STATISTICS IO OFF; PRINT(CONCAT(N'+', REPLICATE(N'-', 148), N'+
    ¦', LEFT(STUFF(CONCAT(N'    Finish Time: ', @_clock_stop, N'   Duration: ', DATEDIFF(mcs, @_clock_start, @_clock_stop)
    / 1000000.0, N' secs.  ', DATEDIFF(mcs, @_clock_start, @_clock_stop) / 1000.0, N' ms.', REPLICATE(N' ', 100)), 76, 3, N''), 148), N'¦
    +', REPLICATE(N'-', 148), N'+'));
    GO
    -- SET STATISTICS XML OFF;
    GO
    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
    PRINT(CONCAT(CHAR(13), CHAR(10), N'   wait a moment...', CHAR(13), CHAR(10)));
    WAITFOR DELAY '00:00:01';
    GO
    -- SET STATISTICS XML ON;
    GO
    DECLARE @_clock_start DATETIME2(7) = SYSDATETIME(), @_test_name NVARCHAR(70) = N' ';
    PRINT(CONCAT(N'+', REPLICATE(N'-', 148), N'+                              
    ¦', LEFT(CONCAT(N'    Start Time: ', @_clock_start, N'   Test Name: ', ISNULL(NULLIF(@_test_name, N' '), N'Test Query 2'),
    REPLICATE(N' ', 100)), 148), N'¦', CHAR(13), CHAR(10), N'+', REPLICATE(N'-', 148), N'+')); DECLARE @_data_dump_int INT; SET STATISTICS IO ON;    
    -- ____________________________________________________________________________________________________________________________________________
    -- ?????? place tsql here ????????????????????????????????????????????????????????????
    INSERT dbo.CK_Test (id, ck_value, some_random_column)
    SELECT
        id.id,
        id.char_val,
        id.some_random_column
    FROM
        #InsertData id;
    -- ?????? place tsql here ????????????????????????????????????????????????????????????
    -- ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
    DECLARE @_clock_stop DATETIME2(7) = SYSDATETIME(); SET STATISTICS IO OFF; PRINT(CONCAT(N'+', REPLICATE(N'-', 148), N'+
    ¦', LEFT(STUFF(CONCAT(N'    Finish Time: ', @_clock_stop, N'   Duration: ', DATEDIFF(mcs, @_clock_start, @_clock_stop)
    / 1000000.0, N' secs.  ', DATEDIFF(mcs, @_clock_start, @_clock_stop) / 1000.0, N' ms.', REPLICATE(N' ', 100)), 76, 3, N''), 148), N'¦
    +', REPLICATE(N'-', 148), N'+'));
    GO
    -- SET STATISTICS XML OFF;
    GO
    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
    PRINT(CONCAT(CHAR(13), CHAR(10), N'   wait a moment...', CHAR(13), CHAR(10)));
    WAITFOR DELAY '00:00:01';
    GO
    -- SET STATISTICS XML ON;
    GO
    DECLARE @_clock_start DATETIME2(7) = SYSDATETIME(), @_test_name NVARCHAR(70) = N' ';
    PRINT(CONCAT(N'+', REPLICATE(N'-', 148), N'+                              
    ¦', LEFT(CONCAT(N'    Start Time: ', @_clock_start, N'   Test Name: ', ISNULL(NULLIF(@_test_name, N' '), N'Test Query 3'),
    REPLICATE(N' ', 100)), 148), N'¦', CHAR(13), CHAR(10), N'+', REPLICATE(N'-', 148), N'+')); DECLARE @_data_dump_int INT; SET STATISTICS IO ON;    
    -- ____________________________________________________________________________________________________________________________________________
    -- ?????? place tsql here ????????????????????????????????????????????????????????????
    INSERT dbo.CommandoTest (id, ck_value, some_random_column)
    SELECT
        id.id,
        id.char_val,
        id.some_random_column
    FROM
        #InsertData id;
    -- ?????? place tsql here ????????????????????????????????????????????????????????????
    -- ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
    DECLARE @_clock_stop DATETIME2(7) = SYSDATETIME(); SET STATISTICS IO OFF; PRINT(CONCAT(N'+', REPLICATE(N'-', 148), N'+
    ¦', LEFT(STUFF(CONCAT(N'    Finish Time: ', @_clock_stop, N'   Duration: ', DATEDIFF(mcs, @_clock_start, @_clock_stop)
    / 1000000.0, N' secs.  ', DATEDIFF(mcs, @_clock_start, @_clock_stop) / 1000.0, N' ms.', REPLICATE(N' ', 100)), 76, 3, N''), 148), N'¦
    +', REPLICATE(N'-', 148), N'+'));
    GO
    -- SET STATISTICS XML OFF;
    GO

    --==========================================================================================
    /*
    TRUNCATE TABLE dbo.FK_Test;
    TRUNCATE TABLE dbo.CK_Test;
    TRUNCATE TABLE dbo.CommandoTest;
    */
    --==========================================================================================
    /*
    DROP TABLE dbo.FK_Test;
    DROP TABLE dbo.CK_Test;
    DROP TABLE dbo.CommandoTest;
    DROP TABLE dbo.LookUpValues;
    DROP TABLE #InsertData;
    */
    --==========================================================================================

    Something to keep in mind... Before you dismiss the DRI (FK_Test) version as being too slow compared to the others...
    1) the is doing 1M rows in a single insert. Does that really reflect your actual use case?
    2) if you do need to maintain high performance while doing large row count inserts, you can always disable or drop the FK constraint for the insert and then recreate it after the insert is complete. This would give you the best of both worlds.

  • Thanks. You obviously know SQL Sever better than I do. I will look this over as soon as I can.

  • Thanks. I agree with you and plan to use check constraints as a compromise as I seem to be losing this battle. My way of thinking is to do things the “right” way and then adjust if there’s a problem. My teammates assume there will be a problem and want to fudge up front. One guy insists on using select * from table because it’s less typing. I had to write a code generator to stop him from using select *.

  • Thanks. I agree with you and plan to use check constraints as a compromise as I seem to be losing this battle.

    I've lost those battles before only to have data integrity issues.  I stopped losing them after the fixes.  The first duty of a DBA is to protect the integrity of the data.  In this simple case, CHECK constraints could work because of the limited number of choices.  But the RI would be the better and more flexible solution.

    One exception could be state codes.  I have seen setups where each state has a number associated with it in addition to its unique postal code.  That case is an unnecessary join.  This assumes the work is limited to the US and Canada.

  • Those tables for joins are so small they're always going to be in memory and the effort to add them in is trivial. Since you could grow beyond three states, I always consider that I would avoid the bit choice. I'd use a tinyint and assume we're not going past 127 values.

    That being said,  developers should stop hard coding things like this. It's technical debt and it's easy to forget.  A quick story. Back in 2005 or 2008, one of the service packs blew up lots of Agent schedules. The reason was that they added an option for duration, which was stored as an enum, an application hard coded construct. One of the developers remarked later that there's a) no such thing as an easy change in a large app and b)this should have been stored in a table/config file and loaded, not hard coded.

  • >> We have a voting application where users can vote on Issues. In this scenario a user can cast one of three possible values for his/her vote. FOR, AGAINST, ABSTAIN <<

    CREATE TABLE Ballots
    (ballot_topic_id CHAR(5) NOT NULL
    REFERENCES Ballot_Topics (ballot_topic)
    ON DELETE CASCADE,
    user_id CHAR(10) NOT NULL
     REFERENCES Users(user_id)
    ON DELETE CASCADE,
    vote CHAR(1) DEFAULT 'A' NOT NULL
    CHECK(vote IN ('A', 'Y', 'N')
    );

    Let's start off by normalizing your schema. Users are clearly entities and entities get their own table. The ballot issues are clearly entities and they get their own table. These two tables would be referenced in the in the voting. The voting has three values and I don't see any way you're going to add any more value values to it.

    My rule of thumb is that if the referenced table is very large or dynamic, then it's a reference to tabletable. If the values are stable and the set of values is relatively small, then use a check constraint. The advantage of the constraint over the reference is the optimizer can get to it, and you don't have to worry about the overhead of indexing other things.

    CREATE TABLE Users
    (user_id CHAR(10) NOT NULL PRIMARY KEY,
    ..);

    CREATE TABLE Ballot_Topics
    (ballot_topic_id CHAR(5) NOT NULL PRIMARY KEY,
    topic_txt VARCHAR(500) NOT NULL,
    ..)

    >> Some of the developers believe that we don't need the VotingOptionsVotingOptions table and that the three VotingOptionsVotingOptions of FOR, AGAINST, ABSTAIN can be handled in code. They believe that the field [sic] TheVoteCastByTheUser in table TheVoteTheVote should be a VarChar that the code sets to FOR, AGAINST or ABSTAIN (or a value of F, A, AB to represent those values). They feel the overhead of joining to the VotingOptions table is not worth the referential integrity benefit. Some of us believe that although the contents of the VotingOptionsVotingOptions table will never change, that proper design requires the use of the VotingOptionsVotingOptions table.<<

    This is sort of a mixed bag for me: they are correct that the voting options table is absurd. It adds overhead, gains nothing, and doesn't do anything to enforce referential integrity. The check constraint handles at very nicely. But they got the part about the design wrong! This design fails to give you even first normal form.

    >> Scenario #2
    The application has customers. Each of the Customers has a status that indicates if the Customer is Active, InActive or Suspended. Like the scenario above we have a static data/support table that contains the CustomerStatus that the Customer table refers to. Unlike the previous scenario from time to timetime the CustomerStatus could add other status values besides Active, InActive and Suspended. <<

    This is another version of the same design problem. However, nobody seems to remember that a status is a "state of being", so by its very nature has a time element we would have a history that would show the start and end dates of a given status. I'd recommend that you get a copy of temporal queries in SQL as a PDF download from the University of Arizona. It was written by Rick Snodgrass (he's very well known in this area of databasedatabase), and it's free.

    CREATE TABLE Cusomer_History
    (customer_id CHAR(10) NOT NULL
    REFERENCES Customers(customer_id)
    start_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,
    end_date DATE,
    CHECK (start_date <= end_date),
    cust_status CHAR(1) NOT NULL
    CHECK (cust_status IN ('A', 'I', 'S'),
    PRIMARY KEY (customer_id, start_date),
    ..);

    There are some tricks about avoiding gaps in such a table due to Kuznetsov, and I recommend that you create a view based on looking for a null end date to show that it is a current status.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Wednesday, September 26, 2018 1:12 PM

    There are some tricks about avoiding gaps in such a table due to Kuznetsov, and I recommend that you create a view based on looking for a null end date to show that it is a current status.

    Since the OP is on 2017, I'd probably suggest avoiding that entirely by using Temporal Tables if maintaining history over time is important.

  • RonKyle - Wednesday, September 26, 2018 7:11 AM

    Thanks. I agree with you and plan to use check constraints as a compromise as I seem to be losing this battle.

    I've lost those battles before only to have data integrity issues.  I stopped losing them after the fixes.  The first duty of a DBA is to protect the integrity of the data. 

    +1: "First duty of a DBA is to protect the integrity of the data"

    RonKyle - Wednesday, September 26, 2018 7:11 AM

    One exception could be state codes.  I have seen setups where each state has a number associated with it in addition to its unique postal code.  That case is an unnecessary join.  This assumes the work is limited to the US and Canada.

    Until they change their 2-letter code, which happened when Newfoundland (NF) became Newfoundland and Labrador (NL). We had a *lot* of addresses and stats.
    Several million rows needed to be updated, back in the days of SQL 6.5.

    Apart from area codes, that was the end of natural keys for me.

  • Had this exact same argument with a friend on Facebook like two days ago. He honestly said A) MongoDB is amazing and B) all of this should live in the code. Basically hinting towards we are obsolete now and every programmer should know how to build a database or be fired.

    Man, I went to town on him pretty bad and I have no issues with more things living in the code versus many on this forum. As it's been mentioned already, it's EXTREMELY EASY to forget where all the business logic lives when it's hard coded like that. Something could be hidden away in some app Joe Blow developed like 2 years ago and has since left the company. There mere fact you have to open a solution, code the change and recompile the application is completely silly. That and expecting every Java programmer to be some master data architect. Just isn't going to go over well.

    Got to think though. I am constantly designing apps in the data lake that have hard coded values for initial concept stuff. I have essentially case logic in Python scripts that are helping build new datasets. It's extremely common that I have to keep going back and editing that logic to factor in more and more variables. It would be so much more modular and scalable if I just referenced a table from a database for what I need that can change with a simple INSERT, UPDATE, and DELETE versus me spending time coding it.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply