Where Do Staging Tables Belong

  • I'm looking for guidelines and ideas on where staging tables should be placed. We have an application that needs about 10 staging tables for data to be stored from Excel and text files. From there stored procedures would be called in SSIS packages to work with the data in the staging tables and do modifications to the "real", "permanent" tables in the database.

    My question about all of this is, "Where should these staging tables be placed?" The tables will be loaded and then truncated once the data has been used to update the real tables. The database is already 1.6 TB in size. I don't yet know how much data would be flowing into and out of these staging tables and I would like to avoid the dead weight of free space on the database from these staging tables and the extra activity in the transaction log backups from the Inserts to the staging tables. This activity is also going to increase the size of my daily differential backup.

    I have a database on the same SQL instance that I could put these staging tables in that is only 48 MB in size right now. Adding a few GB of size to this database would not be a big deal, but if I put the staging tables in this database then I am making that database be a necessary part of maintaining the primary database. Is that something I should consider doing anyway?

    So, the question is, "Do I put the staging tables in the main database or in a separate database?" Is there a best practice and why is that the best practice?

  • The answer depends upon what resources you can offer, and what risks you are willing to take. I would prefer to create an ETL server on its own OS (with SSIS setup upon it or a 3rd box), because I do not want to comingle ETL resource demands with my system's OLTP resource demands (I instead want to separately measure and monitor each of their demands). If your resource constraint is SQL Server licenses, put the tables in a separate database, preferably with its database files residing on a completely separate IO subsystem. To minimize risk, load testing of various setups should be done. If you cannot load test, did a vendor write the application? If a vendor wrote the application, seek their guidance (I would still test, because my systems will never match a vendor's systems).

  • lmarkum (12/28/2015)


    So, the question is, "Do I put the staging tables in the main database or in a separate database?" Is there a best practice and why is that the best practice?

    IMHO, the answer is always "in a separate database". Staging tables should never have FKs so no problem there. What may be a problem is that your main database is likely in the FULL Recovery Model, which usually means that there's no chance of doing anything in a minimally logged fashion. If you have a separate database for your staging tables, then you can set that database to the SIMPLE Recovery Model and do a shedload of minimally logged activities including the initial load of the staging tables. You also don't have to backup such a database even if you have code in place on that database. The code should be kept in a different filegroup/file and should only be backed up if a change is detected in the code. You should never have to backup the tables because your code should check to see if they exist and, if they don't, it should create them using one method or another. Ostensibly, if something goes absolutely haywire in the staging database, you should have source control code that will allow you to simply drop the database and replay the code to rebuild it and the code it contains. Having the staging tables in a separate database can also help maintenance because they won't ever need to be considered for index or stats updates.

    You can still keep with a 2 part naming convention back to the main database by using synonyms or passthrough views.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • SoHelpMeCodd (12/28/2015)


    The answer depends upon what resources you can offer, and what risks you are willing to take. I would prefer to create an ETL server on its own OS (with SSIS setup upon it or a 3rd box), because I do not want to comingle ETL resource demands with my system's OLTP resource demands (I instead want to separately measure and monitor each of their demands). If your resource constraint is SQL Server licenses, put the tables in a separate database, preferably with its database files residing on a completely separate IO subsystem. To minimize risk, load testing of various setups should be done. If you cannot load test, did a vendor write the application? If a vendor wrote the application, seek their guidance (I would still test, because my systems will never match a vendor's systems).

    I could potentially do this using SSIS on a different machine. There is already SSIS installed on the machine that hosts the primary database and there is an SSIS package that is already doing some of this work. I do understand your perspective about not using resources on the OLTP instance for SSIS. I may consider moving things around so I could have SSIS on a separate machine.

    This is an in-house app that we can certainly do load testing on.

  • Thanks for the feedback Jeff. As you noted, my primary database is in the FULL recovery mode. The fact of all this activity being logged in the staging tables is one of the reasons I didn't want to do this in that primary database. What you are saying about using a staging database makes sense.

  • +1 for separate staging-database

    It is true that once you make the decision to have tables your system depends on in a second database you will be increasing your system's footprint. The new database has to be considered for things like backup schedules, code versioning and deployments, instance and database security, disaster recovery and high availability (HA), and lots of other areas. In many ways these areas bring a fixed amount of overhead to the equation so be aware lf the additional overhead being introduced by this effort. I find the setup very necessary in most of my OLTP databases to the point where a staging database on the same instance is all but standard equipment now. Note that I do not condone any ETL processes issuing DDL statements except for the occasional index recreation but I like to do that dynamically in case the index def is changed.

    From an HA-perspective, if you are using AlwaysOn Availability Groups adding this staging database to the same Group as the OLTP database makes HA simple and fast however it requires FULL Recovery on the staging database and so you give up the potential for T-log and backup benefits in some areas. The point I am trying to make is that there are good things about having a staging database next to your OLTP databases but you need to be disciplined in treating this new database as a part of the system on the same level as your OLTP database. Sounds like you have thought through a lot of the benefits of certain configurations so as long as you apply that same approach to all the system requirements you will be fine.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I'll secound Jeff's emotion that ETL staging tables should be contained in a separate dedicated database. Not only would it compete for resources with the OLTP database, but it would be obscene to include all that scratch data in the transaction log and daily backups.

    You may want to consider staging your data on a separate instance, on a separate machine. For example, SQL Server 2014 Express Edition is free and can be used without restrictions in a production environment. Despite the following limitations, and even if you had it running on an old server box, even an old desktop PC, it might still process faster since it would be a dedicated environment. You could even have multiple such staging instances processing in tandem.

    - Limited to lesser of 1 Socket or 4 cores

    - Max Memory per instance: 1 GB

    - Max Database size: 10GB (data file size only)

    - Max instance per server: 50

    - No limit on remote connections

    That assumes we're talking about data that gets routinely truncated. However, if your staging environment is more of an ODS (operational data store) for retaining history, then you'll want to keep it on a single Standard or Enterprise edition instance.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • lmarkum (12/28/2015)


    SoHelpMeCodd (12/28/2015)


    I do understand your perspective about not using resources on the OLTP instance for SSIS. I may consider moving things around so I could have SSIS on a separate machine.

    It DOES sometimes put you between a rock and a hard spot. You want to unfetter OLTP instances as much as possible but you do have to be care with licensing. Installing SSIS on a separate machine isn't free. It is treated as a separate instance with separate licensing.

    To be honest, I haven't had the need for SSIS for any of my ETL stuff. I've always been able to do it all without. It does take a different bit of knowledge, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • We have staging databases on separate servers, which are considerably less powerful than the servers the live databases use.

    Data is transferred by replication, SSIS package or a bespoke process (don't ask - that way lies madness).

    The staging databases are then used to supply the data warehouse databases and provide data for various reports.

    Partitioning is used to keep the data in staging to the last few days - basically long enough for any issues within the data warehouse to come to light.

    Simple recovery on the staging databases - we're never going to try and recover data from backups for those databases. The retention period on the live databases is longer than that of staging, so we can re-run any extracts to staging if we have to.

    Steve Hall
    Linkedin
    Blog Site

  • I'm almost done with a new ETL process (provided the additions stop long enough) that's pretty involved and it has its own database for this very reason. All the procedures, functions and very limited master data from OLTP databases is stored locally, as is the data set currently being processed. The permanent data that results from it is stored in the OLTP databases.

    The ETL is complex and produces some large transaction logs that I don't really want or need. A PIT recovery is never going to be required because the current data set can be rebuilt at any time. The simple recovery model was the way to go.

    Oh, and the whole thing is done in T-SQL. 😉

  • Also and not to be a contrarian to anything posted thus far... just explaining a different method... we don't do DR for staging databases and we don't even bother backing them up mostly because all of the data is temporary in nature and can easily be reimported (the original files are backed up on our source systems especially since we have to keep them for X number of years). We have scripts that will regenerate the staging databases in seconds and the main databases can, in fact, operate without the staging databases being present, as they should. In those cases where we do need to keep certain logging, that either gets logged in the main database or it gets logged in the same place as the source files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Definitely helpful feedback and it assures me that I was on the right path I just needed to confirm that if possible.

    As for the use of SSIS, I encouraged a Junior Developer to use SSIS on this project for a couple of reasons. First, my businews does a lot with text files and Excel files. That is a big way we consume data from our customers. SSIS was designed to work with this type of data. Second, the team currently has a habit of creating a VB or C# app to handle this data consumption. Thus there is a proliferation of ltitle apps. Then they use Windows task scheduler to fire the app or occasionally it is a stand alone app that someone in the Accounts or Data Team runs manually as needed.

    These apps all manipulate data in the database and because I am not a VB or C# developer I am then dependent on others when these processes fail or have some sort of issue. So, I wanted to bring data manipulation inside SQL Server where I could manage it better.

  • lmarkum (12/29/2015)


    Definitely helpful feedback and it assures me that I was on the right path I just needed to confirm that if possible.

    As for the use of SSIS, I encouraged a Junior Developer to use SSIS on this project for a couple of reasons. First, my businews does a lot with text files and Excel files. That is a big way we consume data from our customers. SSIS was designed to work with this type of data. Second, the team currently has a habit of creating a VB or C# app to handle this data consumption. Thus there is a proliferation of ltitle apps. Then they use Windows task scheduler to fire the app or occasionally it is a stand alone app that someone in the Accounts or Data Team runs manually as needed.

    These apps all manipulate data in the database and because I am not a VB or C# developer I am then dependent on others when these processes fail or have some sort of issue. So, I wanted to bring data manipulation inside SQL Server where I could manage it better.

    We do the same... lots and lots of text and Excel files. And I thoroughly agree with moving the data manipulation inside of SQL Server for the very same reasons you state. I just don't use SSIS to do any of it. Between BULK INSERT, BCP, the ACE drivers, and a couple of other tricks, I've been able to avoid the pain that I call SSIS. Such avoidance isn't for everyone though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I have had people scoff at running staging databases in FULL, taking backups and including them in Availability Groups and have had people scoff at running them in SIMPLE, not taking backups and having no automated HA for them. I have done both and variations in between. The bottom line is we must meet the customer's Recovery Point and Recovery Time Objectives and those concerns will usually trump developer-preference. There are going to be tradeoffs no matter which configuration is selected.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (12/29/2015)


    I have had people scoff at running staging databases in FULL, taking backups and including them in Availability Groups and have had people scoff at running them in SIMPLE, not taking backups and having no automated HA for them. I have done both and variations in between. The bottom line is we must meet the customer's Recovery Point and Recovery Time Objectives and those concerns will usually trump developer-preference. There are going to be tradeoffs no matter which configuration is selected.

    I agree.

    ... but, sometimes you might have to ask the customer why they want those recovery time and points for a staging database. Make it clear to them that you're only asking so you can deliver the best solution for them.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply