Get over 500,000 records into a table

Question

Post reply

Get over 500,000 records into a table

rmahon

SSC Veteran

Points: 288
More actions
January 22, 2022 at 7:15 pm

#3976086

Hello Everyone, hope everyone is well and staying safe!

I have an issue that I've been working on all week. I hope someone can put a different perspective on my process to get it to succeed. I currently have my blinders on and cannot think of anything else to try.

I have the following scenario: Database (360GB), Query that has 3 inner joins and one of those tables is 206 GB in size. That table also contains a varbinary(max) column that hold a blob. The record count returned by the query is 583,393 which needs to be inserted into a table within the same database.

I have tried the following:

- Perform a SELECT INTO

- Tried the transfer with SSIS Data Flow task, Fast Load (adjusting the "Row per batch" and "Maximum insert commit size")

- Tried the Export Wizard

- requested extra disk space, up to a TB

- requested more memory allocation 64GB for a total of 96GB

- set max memory set to 92GB.

Unfortunately, I cannot get it to complete. I get to approx. 396,000 row insert into the table and it crashes with a memory error and this error.

Information 0x4004800c: Data Flow Task 1: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 0 buffers were considered and 0 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

(SQL Server Import and Export Wizard)

Information 0x4004800f: Data Flow Task 1: Buffer manager allocated 3 megabyte(s) in 1 physical buffer(s).

(SQL Server Import and Export Wizard)

Information 0x40048010: Data Flow Task 1: Component "Source - Query" (1) owns 3 megabyte(s) physical buffer.

(SQL Server Import and Export Wizard)

Error 0x80070070: Data Flow Task 1: (SQL Server Import and Export Wizard)

Error 0xc0208266: Data Flow Task 1: (SQL Server Import and Export Wizard)

Error 0xc0208265: Data Flow Task 1: Failed to retrieve long data for column "FileData".

(SQL Server Import and Export Wizard)

Error 0xc020901c: Data Flow Task 1: There was an error with Source - Query.Outputs[OLE DB Source Output].Columns[FileData] on Source - Query.Outputs[OLE DB Source Output]. The column status returned was: "DBSTATUS_UNAVAILABLE".

(SQL Server Import and Export Wizard)

Error 0xc0209029: Data Flow Task 1: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The "Source - Query.Outputs[OLE DB Source Output].Columns[FileData]" failed because error code 0xC0209071 occurred, and the error row disposition on "Source - Query.Outputs[OLE DB Source Output].Columns[FileData]" specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.

(SQL Server Import and Export Wizard)

Error 0xc0047038: Data Flow Task 1: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on Source - Query returned error code 0xC0209029. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing. There may be error messages posted before this with more information about the failure.

(SQL Server Import and Export Wizard)

Any advice and input would be greatly appreciated. Thanks in advance Rob.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

frederico_fonseca SSCoach Points: 15304 More actions · Answer 1

don't use SSIS for it - if the database is on same server just do a straight insert into it - if it is on a different server then use powershell or C# with a bulkinsert object. (powershell example here https://devblogs.microsoft.com/scripting/use-powershell-to-copy-a-table-between-two-sql-server-instances/

SSIS when dealing with LOBS will be very very slow and will always allocate swap and memory to deal with them in a way that is not healthy for any system.

rmahon SSC Veteran Points: 288 More actions · Answer 2

Thanks Frederico for your response. I did try "Select Into" and "Insert Into" earlier but never finished within a 5 hr. window so I finally killed the query. I haven't tried it since the server team allocated more disk space and memory so I just ran the insert and it's been executing for about 25 minutes as I type this message. Hopefully it will be successful.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

I have to ask... why are you duplicating more than a half million rows from a 206GB table into the SAME database??? If we knew that, we might be able to make a better suggestion that doing the duplication.

That, notwithstanding and according to the error you got, it DID make it more than half way. Sooooooooo.... "Divide'n'Conquer". Don't do it with a single query. Do it in two or three.

Also, what recovery model are you using during this evolution? And, does the target need a clustered index? If the database is in the FULL Recovery Model, try taking a log backup, shift the database to the BULK LOGGED Recovery Model, do the copy, shift back to the FULL Recovery Model, and take another log backup to minimize the period where you can't do a Point-in-Time restore. That might allow you to do it all in a single copy thanks to Minimal Logging.

If you come back with whether or not the new table needs a Clustered Index or not, I can show you a trick to populate the table with Minimal Logging even though the Clustered Index is already in, place. It would also be handy if you posted the CREATE TABLE (including all indexes and constraints) for the original source table as well as the result of a "SAMPLED" run of sys.dm_db_index_physical_stats() for the original table.

Right now, email notifications as to responses on posts like this one are broken and I don't check this site for responses but once every 2 to 6 hours, so be patient.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

rmahon SSC Veteran Points: 288 More actions · Answer 4

Hello Jeff, well this is actually a migration project of a vendor database we are porting the data to a new software product. They want this data in the table below. I was only following the instruction from the PM...which I don't think he had a clue how much data the database contained but now he does. This was supposed to be done last weekend but still fighting it because they said it is just a SELECT INTO table query.

I have backed up the database because PROD resides on a cluster so we could not perform all the disk and memory allocations without an impact. I restored the database on one of our DEV server, which is where I am doing all the work. I don't need any logging so the database is in Simple recovery model...all I need is to get the data described in the initial post into the table for the vendor. If you are looking for the clustered indexes on the table in which the data is being selected, I would need pull this from 4 separate tables. Below is the table structure in which the data is going.

CREATE TABLE [dbo].[UserDocuments](

[fk_SysId] [INT] NOT NULL,

[UpdatedBy] [NVARCHAR](101) NOT NULL,

[DocumentName] [NVARCHAR](100) NOT NULL,

[Keyword] [NVARCHAR](200) NOT NULL,

[FileName] [NVARCHAR](100) NOT NULL,

[FileData] [VARBINARY](MAX) NOT NULL,

[DateCreated] [DATETIME2](7) NOT NULL,

[DocumentId] [INT] NOT NULL

)ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

I would be grateful if you can explain how to break it up or anyway to get this accomplished.

karen.ferrara Right there with Babe Points: 764 More actions · Answer 5

I would try doing the insert in batches based on the primary key or another integer. Create a while loop based off the key field and insert a smaller set of data within the while loop. I've done this in the past for large inserts and it works well. You can probably find an example on this site. or another site. Hope this helps!

ScottPletcher SSC Guru Points: 100262 More actions · Answer 6

--First, I'd strongly urge you to separate the binary data from the main table, which you can do using sp_tableoption.

--Page-compressing the main table will save considerable space, although it may somewhat slower to load.

--Really all those strings should be encoded but I don't think you really want to do that at this point.

CREATE TABLE [dbo].[UserDocuments](
    [fk_SysId] [INT] NOT NULL,
    [UpdatedBy] [NVARCHAR](101) NOT NULL,
    [DocumentName] [NVARCHAR](100) NOT NULL,
    [Keyword] [NVARCHAR](200) NOT NULL,
    [FileName] [NVARCHAR](100) NOT NULL,
    [FileData] [VARBINARY](MAX) NOT NULL,
    [DateCreated] [DATETIME2](7) NOT NULL,
    [DocumentId] [INT] NOT NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
EXEC sys.sp_tableoption 'dbo.UserDocuments', 'large value types out of row', 1 --<<--
--if fk_sysId by itself is not unique, add DocumentId OR DateCreated.
CREATE UNIQUE CLUSTERED INDEX [UserDocuments__CL] 
    ON dbo.UserDocuments ( fk_SysId ) 
    WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) 
    ON [PRIMARY];

--Second, PRE-ALLOCATE enough space to the log space to handle the full load(s). This will drastically speed up the load vs the log having to dynamically extend.

--Finally, I (we?) would like to see the actual query that loads the table. Assuming the query can write the rows as they are processed, I'm not sure why SQL is running out of memory trying to do the load. Looking at the query might help determine this. Is there something in the query that forces SQL to process everything before it starts to load the new table?

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

rmahon SSC Veteran Points: 288 More actions · Answer 7

Here is the query....provided by the vendor.

SELECT tblCon.fk_SysId,

CONCAT (tblPasswd.FName, ' ', tblPasswd.LName) AS UpdatedBy,

tblDoc.DocumentName,

tblDoc.Keyword,

tblDoc.FileName,

tblDoc.FileData,

tblDoc.DateCreated,

tblDoc.DocumentId

FROM tblCon

INNER JOIN tblConDoc ON tblCon.ContextId = tblConDoc.fk_ContextId

INNER JOIN tblDoc ON tblConDoc.fk_DocumentId = tblDoc.DocumentId

INNER JOIN tblPasswd ON tblDoc.LastUpdatedBy = tblPasswd.user_id;

This reply was modified 2 years, 7 months ago by rmahon.

homebrew01 SSC Guru Points: 55454 More actions · Answer 8

karen.ferrara wrote:

I would try doing the insert in batches based on the primary key or another integer. Create a while loop based off the key field and insert a smaller set of data within the while loop. I've done this in the past for large inserts and it works well. You can probably find an example on this site. or another site. Hope this helps!

This.

When record count of the SELECT = 0, exit the loop. TEST FIRST in DEV. Make sure you're not inserting the same group over & over.

INSERT INTO My_Table

SELECT top 100000 FLD1, FLD2, FLD3

from Other_Table OTH

Where not exists (SELECT * from My_Table MY where MY.KEY = OTH.KEY)

ORDER BY KEY fields --- Can be useful when testing to make sure you get the next group.

This reply was modified 2 years, 7 months ago by homebrew01.
This reply was modified 2 years, 7 months ago by homebrew01.
This reply was modified 2 years, 7 months ago by homebrew01.
This reply was modified 2 years, 7 months ago by homebrew01.
This reply was modified 2 years, 7 months ago by homebrew01.
This reply was modified 2 years, 7 months ago by homebrew01.
This reply was modified 2 years, 7 months ago by homebrew01.

ScottPletcher SSC Guru Points: 100262 More actions · Answer 9

ScottPletcher

SSC Guru

Points: 100262

January 24, 2022 at 8:14 pm

#3976684

That load query is fairly straightforward. I think SQL should be able to write output as it is generated, reducing memory requirements. Would have to see the query plan to be 100% sure.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

rmahon SSC Veteran Points: 288 More actions · Answer 10

Hello, thanks you all for providing input. I am going through all the suggestions. I'm attempting homebrew01 suggestion now. A quick note to ScottPletcher, yes look pretty straight forward...I started the exact query this morning while I tended to other task and went back and looked (actually forgot about it) in 1.5 hours and has only select 25,000 rows of the roughly 590,000 needed. The DEV machines specs are as follows:

4 cores

96 GB or memory - 92GB to SQL

Data drive > 1 TB

Log drive > 260 GB

ScottPletcher SSC Guru Points: 100262 More actions · Answer 11

Did you pre-allocate enough log space to handle the entire insert?

I had a load task that took 4+ hours. I noticed the initial log size was very small. After I pre-allocated the log, it finished in 20 mins and change.

Also, make sure SQL doesn't have to do a sort for the INSERT. That could force SQL to produce, and store, the entire result set before starting the INSERTs.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

rmahon SSC Veteran Points: 288 More actions · Answer 12

Hello Eveyone. Here is the latest. I decided to try the insert of all the records WITHOUT the incredibly large field [varbinary(max) column...aka the blob of the actual pdf document]. It inserted the full amount 583,393 in less than 10 seconds. So, we all know the column listed above it the main reason for the amount of time to load, the amount of space taken on the database, the amount memory to process! I am now trying to update the field of the loaded table with the [varbinary(max) column...aka the blob of the actual document] values from the tblDocuments table. I am doing this in batches of 70,000 however I believe I am going to kill the process as it has been running for 1hr. 30min. and the log file is beginning to run out of disk space. I am going to edit my script and try smaller increments. Does anyone have any suggestion on updating one column in a database in batches that would run more efficient? Update script...

DECLARE @Rowcount INT = 1;

WHILE (@Rowcount > 0)

BEGIN

PRINT @Rowcount;

UPDATE TOP (70000)

ud

SET ud.FileData = d.FileData

FROM dbo.UserDocuments ud

INNER JOIN dbo.tblDocuments d ON ud.DocumentId = d.DocumentId;

SET @Rowcount = @@ROWCOUNT;

CHECKPOINT;

END;

ScottPletcher SSC Guru Points: 100262 More actions · Answer 13

That's a lot of images at one time. I suggest starting at 10K or even 5K and see how that goes first.

Btw, be sure to use sp_tableoption before UPDATEing the images. That will make the image load process vastly more efficient.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

rmahon SSC Veteran Points: 288 More actions · Answer 14

rmahon

SSC Veteran

Points: 288

January 25, 2022 at 8:51 pm

#3977146

Hello Scott, can you elaborate on the sp_tableoption, please?