Faster ETL import process

  • I am currently testing out scenario where import log files into a staging table then into a fact table which is then used to populate an Essbase cube for analysis.

    I use a package to import the raw text files into the staging area. This runs in about 3 minutes and imports about 50 million rows.

    The Structure of tables and sql used to import into the fact tables is as follows itself is as follows:

    --Staging Table

    CREATE TABLE [stg_LogFiles] (

    [LineRead] [varchar] (80) NULL ,

    [DateLoaded] [datetime] NULL

    )

    -- Fact Table

    CREATE TABLE [fct_Logs] (

    [LogDate] [datetime] NULL ,

    [Application] [int] NULL ,

    [Database] [int] NULL ,

    [User] [int] NULL ,

    [Retrieval] [int] NULL

    ) ON [PRIMARY]

    --Stored procedure to populate fact table

    CREATE PROCEDURE sp_LoadLogFact

    AS

    CREATE TABLE #F1 (

    [Date] VARCHAR (80) ,

    [Application] [varchar] (25),

    [Database] [varchar] (25),

    [User] [varchar] (25)

    )

    INSERT INTO #F1([Date],[Application],[Database],[User])

    (

    SELECT

    SUBSTRING(LineRead,2,24)

    ,SUBSTRING(

    LineRead,

    CHARINDEX('/',LineRead)+1,

    CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)-1 - CHARINDEX('/',LineRead))

    ,SUBSTRING(

    LineRead,

    CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1,

    CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1)-1

    - CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1))

    ,SUBSTRING(

    LineRead,

    CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1) +1,

    CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1)+1)- 1 -

    CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1) +1))

    FROM dbo.stg_LogFiles

    WHERE CHARINDEX('[',LineRead) = 1

    AND NOT SUBSTRING(LineRead,

    CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)

    ,2) = '//'

    AND SUBSTRING(LineRead,2,4) IN ('mon ','tue ','wed ','thu ','fri ','sat ','sun ') -- ONLY SELECT records where user id is present

    )

    INSERT INTO [ESSBASE_TEST_DW].[dbo].[fct_Logs]([LogDate], [Application], [Database], [User], [Retrieval])

    (select

    convert(datetime,substring([Date],5,7)+ right([Date],4) + substring([Date],11,9),108),

    A.ApplicationID,

    A.DatabaseID,

    U.UserID,

    1

    FROM #f1 B

    LEFT JOIN dbo.dimApplication A ON A.Application = B.Application

    LEFT JOIN dbo.dimUser U ON U.UserName = B.[User]

    )

    DROP TABLE #F1

    Currently this process takes about 26 min to run. The bulk of the time being taken by the stored procedure. I would just like some ideas on how to improve the speed of this process. There are no indexes on the tables and I am not sure if the SQL used in the stored procedure is the most efficient.

    Any information would be very much appreciated.

  • I could say at a first glance that you are using a #temp table that could be avoided using a CTE, but there's much more that could be involved in tuning this process.

    Take a look at this article[/url]: you'll find information on how to ask a performance related question effectively. I'm sure that if you provide all the information as explained in the article, lots of people here will be willing to help you.

    -- Gianluca Sartori

  • I could say at a first glance that you are using a #temp table that could be avoided using a CTE, but there's much more that could be involved in tuning this process.

    Take a look at this article[/url]: you'll find information on how to ask a performance related question effectively. I'm sure that if you provide all the information as explained in the article, lots of people here will be willing to help you.

    -- Gianluca Sartori

  • you should really have some indexes on the temp table perhaps on User and Application.

    Can you post the actual execution plan for the stored proc?

  • I have included a rtf file showing the graphical representation of the execution plan.

    Regards

  • Is this SQL 2000 or SQL 2005 ?

    Please post the actual execution plan as either an .sqlpan file or an excel file.

    See This article[/url] for a detailed desciption of how to do it.

    /SG

  • The first thing you should determine is how long each statement is taking.

    If it is the second step (INSERT INTO [ESSBASE_TEST_DW].[dbo].[fct_Logs]) that is taking too long to run, there is a good chance that creating indexes on columns dbo.dimApplication Application and dbo.dimUser User will speed it up because they are used in the join.

  • I am wondering why are you using a staging table in the first place. The logic you are using looks pretty simple to implement in SSIS with a combination using things like lookup and derived column transformations.

    T-SQL has never been a particularly good choice for string manipulation, so that won't be helping performance.

    Also, by doing avoiding the staging database, your I/O load should drop quite significantly.

  • Apologies for the double posting of this topic and any other confusion. Initially I was testing this in a sql 2000 environment which is why there is an original post in the sql 2000 forum. However due to maintenance I had to switch to our 2005 environment which explains the post here.

    Another member of the team actually created a ssis package which greatly increased the performance. However for my own knowledge I would like to know via the execution plan how performance could be increased as I am not that familiar with them.

    I have attached a zipped copy of the SQL 2000 execution plan. Many Thanks

  • Using an SSIS package definitely seems like the best way to go.

    The execution plan you posted is unfortunately not an actual plan created with the method described in the article. It is missing information about actual row counts.

    The only thing I can see from the posted plan is that there is a clustered index scan of dimApplication. Unless that table is very small you could probably improve performance by adding a non-clustered index on the following columns: Application, ApplicationID, DatabaseID.

    You should also avoid creating the temp table. It will cause tons of unnecessary IO.

    Just go directly from dbo.stg_LogFiles to the destination using a big INSERT SELECT.

    But still, SSIS with a bulk insert directly into the destination is probably the most efficient way to do it.

    /SG

  • Sorry about the post. I am doing a thousand things at once. I have reposted the execution plan with hopefully the right infoemation. Excuse me for making mistakes as I haven't really used this before.

  • Hey, thanks guys for your help. I actually decided to to look into the execution plan myself and try and work something out. I have managed to get it down from 24mins plus to 7mins 30 mins. This was achieved by getting rid of the temp table and adding nonclustered indexes on the application and user tables as you guys suggested.

    I just really wanted to know if there is anything I could do to speed it up even more. If not then 7mins is fine for the operation.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply