How to ensure order is kept in import from csv

  • I'm importing from csv text files & need to ensure row order is maintained. 

    I've already imported one file and then added an id field so order can be maintained from now on.  But how do I know order was kept during DTS & adding the id field?  I checked a few sample areas and it does seem to match the original file, but I can't check the entire 3 million rows manually.

    So what import methods can I use to be sure order is kept?

    Data: Easy to spill, hard to clean up!

  • If you are concerned with row order when retrieving data you must use the ORDER BY clause in the SELECT statement.  The ANSI standard says that without the ORDER BY clause, rows may be returned in many order, and that is what you are likely to get, regardless of the physical order the rows are stored in

    .

     

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • You could create a clustered index on the target table that uses the same columns that determine the order of the rows in the source csv file.  That would ensure that the data rows in the table are physically ordered in the same manner as they are in your source data.  This is (I think) the only way to be sure that the data stored in the table is ordered the way you want.  

  • Then I assume I can be sure rows are transfered to the destination SQL table in the same order as the source csv file.

    I'll go with that (unless someone out there knows better) - and import into a table with a clustered index.

    Data: Easy to spill, hard to clean up!

  • Yes, the clustered index will ensure that the rows are physically ordered based on the columns you specify in the index.  Some things to bear in mind:  You can only have one clustered index per table (since it is determining the row order, it would not make sense that you can have more than one).  Essentially your clustered index has to define the primary key of the table.  You might want to take a look at the "Using Clustered Indexes" topic in Books Online.  Search for "clustered indexes, overview" in the BOL index.  Make sure you use only the minimum number of columns you need in the clustered index to ensure correct ordering and define your key. 

  • I understand clustered indexes.  What I don't fully understand is DTS. 

    My question:  If DTS imports a csv file with no primary key, will the resulting table be in the same order as the source csv file?  From testing the imported file, it seems to me that it is in the same order.  However, is this behavior guaranteed or by chance?

    Data: Easy to spill, hard to clean up!

  • Yes the order should be maintained on import from the CSV file thru DTS unless your table has a CLustered index that would cause a resort. Being however you say you have a IDENTITY column thou it should not.

  • The behaviour is NOT guaranteed.  You need to understand what is meant by 'order' in a relational database, and how this interacts with a DTS load.

    The physical order in which rows exist has no relevance to the actual order rows are presented to the user in a SELECT statement.  There is 1 and only 1 way to obtain rows from a relational database in a specified sequence and that is the ORDER clause.  If you do not use SELECT ... ORDER BY ... then any sequence of rows you obtain may not be repeatable.  There are many reasons for this, some of which are given below.

    Even if the table you are loading into exists in a single physical file and there is no cluster index, then the physical order in which the data is stored is not guaranted.  If the table exists in multiple physical files the physical order is even less guaranteed.  Even with a cluster index the physical order within the database is not guaranteed.

    When you are inserting rows into a table, SQL Server will allocate an extent of space for your table.  Assuming there is no cluster index and you have a large number of rows, this extent will be filled sequentially from start to end.  If more rows need to be inserted, a new extent will be obtained and rows inserted into that.  Extents are allocated to a table in the physical sequence they exist in the database.

    If the current DTS load is the first load into the table, then at the end of the load, the rows will almost certainly be in the same physical order in the table as they were in the input file.

    If you do any maintenance work on the database before your next DTS load you may have released an extent before the data for your table starts, or even in the middle of it.  The next DTS load will again use free extents in the physical sequence they exist in the database, but as you can see there is no guarantee these extents will be after all previous extents for your table.  In this situation, data for the current load will be physically interspersed between data previously loaded.

    It is important to stress there is nothing wrong with this behaviour.  On the contary this behaviour is desirable as it helps re-use space without extending the physical size of the database.

    Even if a cluster index is used, if the key sequence of new data is intermixed with existing data, an extent split will eventually occur.  The new extent will be the first available in the database, which could physically exist before extents containing data with lower key values.  A cluster index will guarantee that all rows in a given extent are physically stored in the order specified by the key, but has no impact on the physical sequence that extents are stored in.  Although a 'SELECT *' on a table with a cluster index but NOT using an ORDER BY will normally present all rows in the sequence given by the key, the order is not actually guaranteed, and if multiple physical files or parallelism are involved then the order of presentation may not be repeatable.

    So to sum up, there is no point in being concerned about trying to maintain physical sequence of rows across multiple DTS loads, because there is no way to guarantee the sequence.  Your concern should be directed towards obtaining the rows in a desired sequence in a SELECT statement, which will require an ORDER BY.  You can reduce the time required to obtain the data in the desired sequence by defining a cluster index that matches your ODER BY.

     

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Thanks EdVassie for the detailed & thoughtful reply.  The problem, which I keep restating so it's understood, is that I have csv files that need to be imported to SQL Server.  The order of the rows is important, however, there is no combination of columns that I can use to order the rows.  Once in SQL Server, I can easily add an identity column as a primary key to keep the order.  However, will the act of importing and creating a key upset the current order of rows?  It seems the answer is: probably not.

    I've decided that I want to be more certain of the order.  Since no one came up with a DTS solution to ensure order is maintained, the best solution I can think of is to create a program (a batch file if I can, otherwise C++) that adds an additional id column to the csv file.  The additional column will simply number the rows, providing the key for ordering the table in sql server.

    Data: Easy to spill, hard to clean up!

  • If you define an ID column in your table you can load your data using DTS.  All that should be needed is for the DTS package to name all the columns it is inserting data in to.  Any columns in the table that are not named must have a default value, or be an ID column.

    This should save you writing some custom code.

     

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • I recently had an unrelated problem (nul ASCII characters in the text file that confused DTS about the end of the row).  My solution was to read it in line by line in an ActiveX task and re-write each line back to a new file along with a row number.  This should work for your situation by creating an identity column to keep your rows in the order of the original text file.

    '********This step takes about 5 minutes for  ~5.7million rows

    Dim i        'counter

    Const ForReading = 1, ForWriting = 2, ForAppending = 8

       Const TristateUseDefault = -2, TristateTrue = -1, TristateFalse = 0

       Dim fso,fso2, f,f2, ts,ts2

       Set fso = CreateObject("Scripting.FileSystemObject")

       fso.CreateTextFile "d:\Newfile.txt"   ' Create a file.

       Set f = fso.GetFile("d:\Newfile.txt")

       Set ts = f.OpenAsTextStream(ForWriting, TristateUseDefault)

     

    Set fso2 = CreateObject("Scripting.FileSystemObject")

    Set f2 = fso.GetFile("d:\OriginalFile.txt")

    Set ts2 = f2.OpenAsTextStream(ForReading, TristateUseDefault)

    i=1

    Do    While Not  ts2.AtEndOfStream

        s = ts2.ReadLine                   'read next line from file

        ts.write s&"*"& i &" "&vbcrlf    'write to newly create file

        i=i+1 

    Loop

      

    ts.Close

    ts2.Close

    set fso=nothing

    set fso2=nothing

  • I use the below.

    While I haven't anally been concerned with the ordinal inserts, my experience says it does go in "in order" from the cvs file.

    By using the "entryid", you can order on that.

    Also, I have a "get rid of duplicate scripts".  Its slow as molasses but gets the job done.

    The second one uses dynamic sql.  Why?  Because I can pass in which field has the primary id as a parameter.  This is slower of course, but my imports aren't deadline critical.

    Just use a 0 for intWorkingimportjobid, I was setting it up so different people could use the import table at the same time.

     

    My script sets the duplicates as Status=D, I can either delete them, or just ignore them when I run the import procedures.

    Maybe this will help, I dunno.

    ..

     

     

    if exists (select * from sysobjects

    where id = object_id('tblWorkingImportTable'))

    DROP TABLE tblWorkingImportTable

    GO

     

     

    CREATE TABLE

    tblWorkingImportTable (

    datacolumn1

    varchar(320) NULL ,

    datacolumn2

    varchar(320) NULL ,

    datacolumn3

    varchar(320) NULL ,

    datacolumn4

    varchar(320) NULL ,

    datacolumn5

    varchar(320) NULL ,

    datacolumn6

    varchar(320) NULL ,

    datacolumn7

    varchar(320) NULL ,

    datacolumn8

    varchar(320) NULL ,

    datacolumn9

    varchar(320) NULL ,

    datacolumn10

    varchar(320) NULL ,

    datacolumn11

    varchar(320) NULL ,

    datacolumn12

    varchar(320) NULL ,

    datacolumn13

    varchar(320) NULL ,

    datacolumn14

    varchar(320) NULL ,

    datacolumn15

    varchar(320) NULL ,

    datacolumn16

    varchar(320) NULL ,

    datacolumn17

    varchar(320) NULL ,

    datacolumn18

    varchar(320) NULL ,

    datacolumn19

    varchar(320) NULL ,

    datacolumn20

    varchar(320) NULL,

    relationalidforcolumn1

    int NULL ,

    relationalidforcolumn2

    int NULL ,

    relationalidforcolumn3

    int NULL ,

    relationalidforcolumn4

    int NULL ,

    relationalidforcolumn5

    int NULL ,

    relationalidforcolumn6

    int NULL ,

    relationalidforcolumn7

    int NULL ,

    relationalidforcolumn8

    int NULL ,

    relationalidforcolumn9

    int NULL ,

    relationalidforcolumn10

    int NULL ,

    relationalidforcolumn11

    int NULL ,

    relationalidforcolumn12

    int NULL ,

    relationalidforcolumn13

    int NULL ,

    relationalidforcolumn14

    int NULL ,

    relationalidforcolumn15

    int NULL ,

    relationalidforcolumn16

    int NULL ,

    relationalidforcolumn17

    int NULL ,

    relationalidforcolumn18

    int NULL ,

    relationalidforcolumn19

    int NULL ,

    relationalidforcolumn20

    int NULL ,

    entryid

    int IDENTITY (1,1) PRIMARY KEY NONCLUSTERED ,

    createdate

    datetime NOT NULL DEFAULT getDate() , -- used to show when the data was put in the table

    workingimportjobid

    int NULL DEFAULT (0), -- used to distinquish this job from all others

    rowstatus

    char(1) NULL

    )

    GO

    GRANT REFERENCES , SELECT , INSERT , DELETE , UPDATE ON tblWorkingImportTable TO myUser

    GO

     

    And also

     

    if exists (select * from sysobjects where id = object_id('dbo.prc_eliminate_workingtable_duplicates') and sysstat & 0xf = 4)

     drop procedure dbo.prc_eliminate_workingtable_duplicates

    GO

    CREATE PROCEDURE  prc_eliminate_workingtable_duplicates

     (

      @intWorkingimportjobid int ,

      @strUniqueColumnName varchar(64) ,

      @int_return_value int = 0 output ,

      @str_return_message varchar(2000) = '' output

    &nbsp

    AS

     

    declare @strSQLStatement varchar(8000)

    /*

    Select distinct datacolumn2 from tblWorkingImportTable where datacolumn2 IN (

     Select datacolumn2 from tblWorkingImportTable

     GROUP BY datacolumn2

     HAVING Count(datacolumn2) >= 2

    &nbsp

    */

    DECLARE @uniqueid int

    DECLARE @msg varchar(2000)

    DECLARE @firstQueryFETCHSTATUS int

    DECLARE @intCounter int

    select @intCounter = 0 --//default value

    /*

    DECLARE curData CURSOR FAST_FORWARD FOR Select distinct datacolumn2 from tblWorkingImportTable where datacolumn2 IN (

     Select datacolumn2 from tblWorkingImportTable

     GROUP BY datacolumn2

     HAVING Count(datacolumn2) >= 2

    &nbsp

    */

    set nocount on

    select @strSQLStatement = 'DECLARE curData CURSOR FAST_FORWARD FOR Select Distinct ' + @strUniqueColumnName + ' from tblWorkingImportTable where ' + @strUniqueColumnName + ' IN (Select ' + @strUniqueColumnName + ' from tblWorkingImportTable GROUP BY ' + @strUniqueColumnName + ' HAVING Count( ' + @strUniqueColumnName + ' ) >= 2)'

    EXEC (@strSQLStatement)

    OPEN curData

    -- Perform the first fetch.

    fetch curData into @uniqueid

    select @firstQueryFETCHSTATUS = @@FETCH_STATUS

    IF @firstQueryFETCHSTATUS <> 0

    begin

            select @msg = '<<No Duplicates In Import File/Data.>>'

     print @msg

    end

    WHILE @firstQueryFETCHSTATUS = 0

        BEGIN

            SELECT @msg = ' Duplicate ID (' + @strUniqueColumnName + ') = ' + convert(varchar(10), @uniqueid)

            --PRINT @msg

     

       select @intCounter = 0

       DECLARE @entryid int

       

       DECLARE cursorInside CURSOR FAST_FORWARD FOR Select entryid from tblWorkingImportTable where datacolumn2 = @uniqueid

       

       OPEN cursorInside

       

       -- Perform the first fetch.

       

       fetch cursorInside into @entryid

       

           IF @@FETCH_STATUS <> 0

        begin

                select @msg = '<<No data.>>'

         print @msg

        end

       

       

           WHILE @@FETCH_STATUS = 0

       

           BEGIN

        select @intCounter = @intCounter + 1

           

               SELECT @msg = ' One of the duplicates has an entryid = ' + convert(varchar(10), @entryid)

               --PRINT @msg

               SELECT @msg = ' @intCounter = ' + convert(varchar(10), @intCounter)

               --PRINT @msg

        

        if @intCounter > 1

         BEGIN

          -- This is a duplicate

          UPDATE tblWorkingImportTable

          SET rowstatus = 'D'

          WHERE entryid = @entryid

         END

          

        

       

       

               FETCH NEXT FROM cursorInside INTO @entryid

       

           END

       

       CLOSE cursorInside

       

       DEALLOCATE cursorInside

     

     --Select rowstatus from tblWorkingImportTable where datacolumn2 = @uniqueid

            FETCH NEXT FROM curData INTO @uniqueid

     select @firstQueryFETCHSTATUS = @@FETCH_STATUS

        END

    CLOSE curData

    DEALLOCATE curData

    set nocount off

     

    select @int_return_value = 0

    select @str_return_message = 'Success'

    GO

    --GRANT EXECUTE ON prc_eliminate_workingtable_duplicates TO myUser

    GO

  • I'd like a combination of

    Bill Nye's solution, and my input.

    You can use his solution to verify my claims.  Or you can tack on a "ordinal_insert" column on my table, and use bill's ordinal value to populate that field.

     

    .

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply