What Steps Woud You Take: Inserting 200,000 records

  • Hello guys,

    I'm a novice. I need to create a stored procedure to insert 200,000 records into an existing table from a staging table in the same db. (The table has about 6 columns. It has a composite primary key and two non-clustered indexes and one default constraint.) I've read lot of articles online. The information seems overwhelming or complex. The data is not critical. We will likely insert the first 200,000 or less records and after that there will be much less like 10,000 or so additional records.

    Outside of doing an INSERT INTO table (columns) SELECT.... statement, what might i use?
    -  Should i use BULK INSERT? (This, i understand if only for loading flat files.)
    -  Should i right a query that  breaks the data into batches of  5,000, 10,000, 50,000 records?
    What are  some basic things that i can do?

    Thanks!

  • Use INSERT...SELECT.  You can do it all in one go, or you can break it into batches if you're worried about blowing your transaction log or locking too many rows or grabbing too much memory.

    John

  • Thanks!  I found this:

    DECLARE@rc INT = 1;

    WHILE@rc > 0

    BEGIN

      BEGIN TRANSACTION;

     

      INSERT dbo.target(cols)

        SELECT TOP (5000) cols

        FROM dbo.source AS s

        WHERE NOT EXISTS

        (

          SELECT 1 FROM dbo.target AS t

          WHERE t.key = s.key

        )

        ORDER BY clustering_key;

     

      SET @rc = @@ROWCOUNT;

      COMMIT TRANSACTION;

    END

     

  • Unless the rows are insanely wide, a 200,000 row insert is pretty much a cake walk for your initial insert.  And, it can be done using "Minimal Logging", which not only makes it easy on the log file but also nearly doubles the speed.

    1. Take a log file backup.
    2.  Set the database to "Bulk Logged" if it's in the FULL recovery model.  This will NOT break the log file chain.
    3.  Disable (NOT drop) the non-clustered indexes on the target table.  Leave the clustered index enabled.
    4.  Insert the data from your source to target using WITH (TABLOCK) in the INSERT, and ORDER BY to match the clustered index order, and an OPTION (RECOMPILE) to help SQL Server that this INSERT/SELECT will be minimally logged.  It also keeps this one time query from being cached.
    5.  Rebuild the non-clustered indexes to re-enable them.  This will also be "Minimally Logged" and nasty fast.  There is no need to rebuild the clustered index because of the way we did this.
    6.  Return to the FULL recovery model.
    7.  Take a log file backup to end the period where you have a bulk-logged operation in your log files.  Such operations will prevent you from doing a "Point-in-time" restore to the middle of such a transaction log backup so it's important to keep such a period to a minimum.

    Unless you have some really wide rows or other weirdness going on, the whole thing should probably take less than 30 seconds... probably less.  The last time I did something like this (about a week ago) was to import two 200,000 row files (about 50 columns wide) into staging tables with the clustered index in place and I did an EXCEPT to compare to the two tables and write the differences out to a third table.  Including the call to find the latest 2 files to import, the whole thing took 8 seconds to run.  I've got a pretty fast machine at work and so I'm thinking that you loading one table and rebuilding the NCIs might take about 4 times longer (about 30 seconds).  The thing that will take the longest is the ORDER BY if a sort is actually needed.  If not, the ORDER BY won't even come into play and won't appear on the execution plan.

    For more information on this and other data loading techniques, please see the article at the following URL.  Except for the fact that you no longer need to use Trace Flag 610 for certain things, this 2009 document is still the bible for extremely fast loads.  It's a long read but well worth it.  Keep it handy.
    https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v=sql.100)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hi Jeff,

    Thanks for the  information. I shared with my team lead.

    She said:

    We don’t have the ability to take a log backup/set db to bulk logged.  It would have to be escalated to the dba team, and would have to be approved.  Due to the size of the insert, I don’t see why this should be needed. 
    If you are inserting this via a stored proc (which I expect), you will not have the ability to disable indexes/rebuild them, etc

  • 200K rows should be trivial, use "WITH (TABLOCKX)"  and no one will notice (blame the network guys) 
    😎

  • Jacob Pressures - Monday, October 8, 2018 9:10 AM

    Hi Jeff,

    Thanks for the  information. I shared with my team lead.

    She said:

    We don’t have the ability to take a log backup/set db to bulk logged.  It would have to be escalated to the dba team, and would have to be approved.  Due to the size of the insert, I don’t see why this should be needed. 
    If you are inserting this via a stored proc (which I expect), you will not have the ability to disable indexes/rebuild them, etc

    NP.  Hopefully, she understands that I was talking only of the initial load which, as other have stated, isn't really that big at all unless you have wide rows or a ton of LOBs.  Just take out the steps I included for Minimal Logging and taking the backups (those were just to minimize the minimal logging period) and go for it.

    Keep in mind, also, that my suggestion was based on concerns on your end and wanting to do things like split up the load to keep from blowing out the log file, etc, etc.  The method I proposed would have only recorded new page allocations, which would be tiny in comparison to the fully logged method no matter how you split it up. 

    Shifting gears a bit, I'm a little surprised in this day and age of "DevOps" that she's balking at the suggestion of getting the DBA Team involved.  They should always be involved for new table initial loads just to keep everyone's keester out of the fire if something goes wrong.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply