Create Index performance - what am I missing?

  • Brand new database server (Dell R10, 4 x 8 core), SQL enterprise 256 Gig x 64 bit, attached to EMC CX4 SAN load balanced across dual 4G fiber.

    We are just warming it up and doing some testing.

    Pulled over about 280 million rows from a DB2 database, and dumped them into a table with no indexes at all -- about 58 Gigabytes of data.

    Did a create index against three colums and it took 45 minutes. I spent some time moving files and storage areas around and got it down to about 30 minutes.

    Here's my delima -- I can't figure out what the bottleneck is. Here is what I know:

    - The CPU is overall doing nothing (though to be fair during most of the time the equivilent of one core is busy). It doesn't matter whether MAXDOP is set to 1 or left open (which yields about 64 or so threads!).

    - The I/O is negligable -- it hangs around 800-1400 IO/s per second, which is probably 10-20% of what it should be able to do. Every measure agrees, whether from the Windows side or the SAN side, ever path, every part is running at a tiny percent of capacity.

    Aside: To be certain of the above, I put iometer onto the same volumes and can generate about 10,000 io/s in 8k pages even at about 60% write. And that's just to one SAN LUN, I have this on two SAN LUN's and one DASD.

    - No hard faults, negligible faulting in general (heck, the whole thing can be cached, the server is idle with 256G of memory).

    - No other processes so not blocking (though when MAXDOP is not constrained it does show it blocking itself periodically).

    - Wait states are generally page io latch waits with a few cxpacket, which you expect to see some of, though the actual overall I/O rate as mentioned is very low in comparison to the hardware.

    - Log files are on a 0+1 set on a local DASD raid set, which is also near idle in terms of I/O rate.

    For comparison, I did a SELECT Count(*) where field1=field2, where neither field was indexed and the first time (no cache of query) it came back in 8 seconds. So most of the table must be cached. I cleared all the caches and it took 6 minutes. I figure that sets something of the baseline how long it should take to read the data ,but since most of the rest of the work should be in memory (LOTS Of it), where is the other 25 minutes going?

    Incidentally on the above, I hit 96% utilization on the disk LUN where the data is. So the select-after-clear-cache is cleraly I/O limited on that LUN. Which incidentally means my SAN Side measurements track reality, so I remain pretty darn sure it is not I/O wait causing the slow index build, I would see it in both places, and I see it in neither.

    I just can't figure out why, during the create index, nothing is really busy -- not the CPU, Memory or disk. What else is there?

    What am I missing?

  • Creating an index will usually require additional space in the mdf...could be down to autogrowth?

    Or have you set the database to simple recovery model while you're building up the index, to make sure there are not a lot of log writes?

    I've found both of these can significantly slow things down.

  • McSQL (1/19/2011)


    Creating an index will usually require additional space in the mdf...could be down to autogrowth?

    Or have you set the database to simple recovery model while you're building up the index, to make sure there are not a lot of log writes?

    I've found both of these can significantly slow things down.

    Simple: yes.

    And I should have mentioned, I pre-allocated a second storage area on a second LUN for the index and placed it there.

    That's the thing -- if it were space movement, initialization or anything I/o related, I should see one of the IO paths busy. They are just loafing.

    Whatever it is doing, I think it is one core doing it (or equivilent, I realize it moves around), and it is all happening in memory, even when it has decomposed it into a ton of threads.

    This is not a killer, I think clearly something is being single streamed and the server itself has lots more reserve capacity. But it is going to be an ETL machine, so it will build LOTS of indexes, and this was a very disappointing first showing.

  • Hmm, personally I would say that 30 minutes to build an index on 280 million rows is not so bad performance at all really, was this for a clustered index?

    Are you getting similar creation times for clustered/ non-clustered indexes?

    May have been quicker to try BCPing out to a file, ordering, and then BCPing back into table with clustered index using the load hint "-h ORDER(<column>).

    If it isn't though, the method you're using should be the fastest.

    And there are no hardware bottlenecks at all?

  • McSQL (1/19/2011)


    Hmm, personally I would say that 30 minutes to build an index on 280 million rows is not so bad performance at all really, was this for a clustered index?

    .....

    And there are no hardware bottlenecks at all?

    Non-clustered.

    That last is the question -- clearly everything ultimately comes down to a hardware bottleneck, just which one. I feel like I am forgetting to look at something specific.

    But all the obvious seems not busy - no paging, very new, fast server, almost idle SAN and separate DASD raid controller.

    My guess is something in it is very much single streamed and so it can only use one core. But I put a clustered primary key (unrelated to this index) hoping it would parallel the sort (i.e. read from multiple parts of the table at once). No change, or at least not significant.

    Fundamentals -- it can read the entire table in 6 minutes uncached, and seconds cached. The write speed is not a restriction -- both log, temp and data areas are nowhere near saturated and are heavily cached. Something it is doing as it constructs the indexes is just plain slow.

    Ps. As to "That sounds good" -- it all depends on your hardware. This is a lot of hardware and nothing else on it yet. To me it would sound good on my laptop, but sounds really slow on this $400k in hardware. 🙂

  • Try checking file-stalls and top waits on the box to see if anything stands out;

    -- Helps determine which database files on the entire instance have the most I/O bottlenecks

    SELECT DB_NAME(fs.database_id) AS [Database Name], mf.physical_name, io_stall_read_ms, num_of_reads,

    CAST(io_stall_read_ms/(1.0 + num_of_reads) AS NUMERIC(10,1)) AS [avg_read_stall_ms],io_stall_write_ms,

    num_of_writes,CAST(io_stall_write_ms/(1.0+num_of_writes) AS NUMERIC(10,1)) AS [avg_write_stall_ms],

    io_stall_read_ms + io_stall_write_ms AS [io_stalls], num_of_reads + num_of_writes AS [total_io],

    CAST((io_stall_read_ms + io_stall_write_ms)/(1.0 + num_of_reads + num_of_writes) AS NUMERIC(10,1))

    AS [avg_io_stall_ms]

    FROM sys.dm_io_virtual_file_stats(null,null) AS fs

    INNER JOIN sys.master_files AS mf

    ON fs.database_id = mf.database_id

    AND fs.[file_id] = mf.[file_id]

    ORDER BY avg_io_stall_ms DESC;

    *****************************

    WITH Waits AS

    (SELECT wait_type, wait_time_ms / 1000. AS wait_time_s,

    100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct,

    ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK'

    ,'SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE'

    ,'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP','CLR_MANUAL_EVENT'

    ,'CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT'

    ,'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN'))

    SELECT W1.wait_type,

    CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,

    CAST(W1.pct AS DECIMAL(12, 2)) AS pct,

    CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct

    FROM Waits AS W1

    INNER JOIN Waits AS W2

    ON W2.rn <= W1.rn

    GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct

    HAVING SUM(W2.pct) - W1.pct < 99; -- percentage threshold

    -- Common Significant Wait types with BOL explanations

    -- *** Network Related Waits ***

    -- ASYNC_NETWORK_IO Occurs on network writes when the task is blocked behind the network

    -- *** Locking Waits ***

    -- LCK_M_IX Occurs when a task is waiting to acquire an Intent Exclusive (IX) lock

    -- LCK_M_IU Occurs when a task is waiting to acquire an Intent Update (IU) lock

    -- LCK_M_S Occurs when a task is waiting to acquire a Shared lock

    -- *** I/O Related Waits ***

    -- ASYNC_IO_COMPLETION Occurs when a task is waiting for I/Os to finish

    -- IO_COMPLETION Occurs while waiting for I/O operations to complete.

    -- This wait type generally represents non-data page I/Os. Data page I/O completion waits appear

    -- as PAGEIOLATCH_* waits

    -- PAGEIOLATCH_SH Occurs when a task is waiting on a latch for a buffer that is in an I/O request.

    -- The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem.

    -- PAGEIOLATCH_EX Occurs when a task is waiting on a latch for a buffer that is in an I/O request.

    -- The latch request is in Exclusive mode. Long waits may indicate problems with the disk subsystem.

    -- WRITELOG Occurs while waiting for a log flush to complete.

    -- Common operations that cause log flushes are checkpoints and transaction commits.

    -- PAGELATCH_EX Occurs when a task is waiting on a latch for a buffer that is not in an I/O request.

    -- The latch request is in Exclusive mode.

    -- BACKUPIO Occurs when a backup task is waiting for data, or is waiting for a buffer in which to store data

    -- *** CPU Related Waits ***

    -- SOS_SCHEDULER_YIELD Occurs when a task voluntarily yields the scheduler for other tasks to execute.

    -- During this wait the task is waiting for its quantum to be renewed.

    -- THREADPOOL Occurs when a task is waiting for a worker to run on.

    -- This can indicate that the maximum worker setting is too low, or that batch executions are taking

    -- unusually long, thus reducing the number of workers available to satisfy other batches.

    -- CX_PACKET Occurs when trying to synchronize the query processor exchange iterator

    -- You may consider lowering the degree of parallelism if contention on this wait type becomes a problem

    ****************************

    Might help narrow it down a bit if you see one particular file with significant I/O or if there is an unexpected wait type.

    Credit to Glenn Berry for the script source.

  • Is there a difference between online and offline?

    What about sort in tempdb?

  • Turn Hyperthreading off, see what happens. You mentioned a quad socket octo-core system, but 64 threads.

    Definitely look at SORT_IN_TEMPDB and ONLINE; try both ways.

    Also, you mentioned MAXDOP 0 and 1; try 4 or 8.

    Are you watching the local system OS stats as well; are you thrashing the pagefile for some reason, perhaps?

    Check the CU's for whatever SP you're on; maybe it's a bug.

    Watch raw throughput as well as IOPS; is something saturating whatever bandwidth the SAN has?

    On the SAN side, are snapshots or something else using up some performance?

  • Hmmm, interesting puzzle.

    Ferguson (1/18/2011)


    Pulled over about 280 million rows from a DB2 database, and dumped them into a table with no indexes at all -- about 58 Gigabytes of data.

    Did a create index against three colums and it took 45 minutes. I spent some time moving files and storage areas around and got it down to about 30 minutes.

    Here's my delima -- I can't figure out what the bottleneck is. Here is what I know:

    The five bottlenecks:

    CPU, Memory, DiskI/O, DiskConcurrency, and Network.

    - The CPU is overall doing nothing (though to be fair during most of the time the equivilent of one core is busy). It doesn't matter whether MAXDOP is set to 1 or left open (which yields about 64 or so threads!).

    As mentioned above, make sure your hyperthreading is turned off. It's usually not good for a SQL Server though there are, like everything, exceptions. Otherwise, CPU's out as the bottleneck.

    The I/O is negligable -- it hangs around 800-1400 IO/s per second, which is probably 10-20% of what it should be able to do. Every measure agrees, whether from the Windows side or the SAN side, ever path, every part is running at a tiny percent of capacity.

    Aside: To be certain of the above, I put iometer onto the same volumes and can generate about 10,000 io/s in 8k pages even at about 60% write. And that's just to one SAN LUN, I have this on two SAN LUN's and one DASD.

    That kills Disk I/O directly, and concurrency issues.

    - No hard faults, negligible faulting in general (heck, the whole thing can be cached, the server is idle with 256G of memory).

    That kills memory.

    - No other processes so not blocking (though when MAXDOP is not constrained it does show it blocking itself periodically).

    That kills network.

    - Wait states are generally page io latch waits with a few cxpacket, which you expect to see some of, though the actual overall I/O rate as mentioned is very low in comparison to the hardware.

    So, after you've quite nicely checked all our main culprits... you're still showing Disk I/O waits. Whaaa? Hm.

    - Log files are on a 0+1 set on a local DASD raid set, which is also near idle in terms of I/O rate.

    Your waits for the log would be 'writelog' waits, instead of PAGEIOLATCH. Ain't the logfile.

    For comparison, I did a SELECT Count(*) where field1=field2, where neither field was indexed and the first time (no cache of query) it came back in 8 seconds. So most of the table must be cached. I cleared all the caches and it took 6 minutes. I figure that sets something of the baseline how long it should take to read the data ,but since most of the rest of the work should be in memory (LOTS Of it), where is the other 25 minutes going?

    It's going to sorting... hmmmmmm... sorting. Where's your tempdb at?

    What am I missing?

    System table storage. Sorts will go to TempDB at a certain point. Are you checking the stats on the system databases?


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • First thing that pops into my head is Hyperthreading (as also mentioned by someone else).

    I had a similar issue where everything was just dragging and I couldn't figure out why...turns out SQL was assigning threads to the virtual HT cores because it can't tell the difference, and those cores are much, much less powerful than a normal core.

    After you disable that at the bios level and reboot, try setting maxdop 2, 4, or 8 when you create the index and see if you see some of your cores spike.

  • Thanks for the insight, I need to spend more time on this. We are putting in more disk on the SAN tomorrow so we have a bunch of downtime, and when back up will find some time to do more testing.

    We had moved on to other tests, and it seems to confirm that I am getting a single thread (despite sometimes a monsterous amount of parallelism generated) that is doing all the work. And that one thread seems to be about right for one code (not one hyper-core, one whole core -- my understanding is one hyper-core is a full core if nothing is competing for it, right?).

    But the real issue is what we can do to get more activity in the parallel threads. Also, I need to investigate the wait states -- to some of the comments above, my monitoring was limited to brief looks in sysprocesses, nothing over the life of the run. I need to do that (we have tools like SQL Sentry, plus the built-in stuff). Any create index is going to have SOME i/o waits, but the san is definitely, positively not busy.

    More later, and thanks...

  • Does the table have a clustered index on it when you build the NC index?

  • Lynn Pettis (1/24/2011)


    Does the table have a clustered index on it when you build the NC index?

    I tried it both ways, it was marginally faster with the clustered index first. I had thought that might allow for more parallelism (or at least more effective parallelism). Not much help.

  • Craig gave a great walkthrough; thank you, Craig.

    One great point he brought up was tempdb; in particular, monitor that, and I'd be very curious as to the "true" physical level of your database files; i.e. what databases (especially) tempDB are on what physical spindles, how are those spindles configured, and what _else_ is on those physical spindles.

    Also, of course, use Perfmon or similar to monitor _all_ disk activity, from the paging file and the OS through every single Logical/Physical disk the server has access to; might as well through Network in as well, just in case (mirroring, network shares, replication, etc.).

    When you don't know... watch everything.

  • Have you tried disabling hyperthreading yet?

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply