Importing very large flat file throws truncation errors

  • I'm having hue difficulties importing US.txt, see this file: http://download.geonames.org/export/dump/US.zip (overview of all files and readme at bottom of page: http://download.geonames.org/export/dump/)

    Im getting all sort of truncating errors, for example I got a truncation error on record nr 1281, a record where nothing weird seems to be going on.

    Have been playing around with some settings but nothing works.

    I know that since its a flat file I can configure the column length on the advanced tag when choosing a datasource.

    My questions short and sweet:

    1. what should be my table definition? This is what I have now:

    CREATE TABLE [dbo].[cities_geonames](

    [geonameid] [int] NULL,

    [name] [nvarchar](max) NULL,

    [asciiname] [nvarchar](max) NULL,

    [alternatenames] [nvarchar](255) NULL,

    [latitude] [float] NULL,

    [longitude] [float] NULL,

    [feature class] [nvarchar](255) NULL,

    [feature code] [nvarchar](255) NULL,

    [country code] [nvarchar](255) NULL,

    [cc2] [nvarchar](255) NULL,

    [admin1 code] [nvarchar](255) NULL,

    [admin2 code] [nvarchar](255) NULL,

    [admin3 code] [nvarchar](255) NULL,

    [admin4 code] [nvarchar](255) NULL,

    [population] [nvarchar](255) NULL,

    [elevation] [nvarchar](255) NULL,

    [gtopo30] [float] NULL,

    [timezone] [nvarchar](255) NULL,

    [modification date] [datetime] NULL

    ) ON [PRIMARY]

    2. what should be my configuration on the advanced tab with regard to column length and type, if any

  • pretty sure it has to do with the import package defaulting to a 50 char field, and the data for example, on line 1284 is 54 characters in length.

    "Beautiful Zion African Methodist Episcopal Zion Church" happens to be the first item that is over 50 chars. (54)

    assuming you are using Import Data... for building your package to import it quickly, you'll want to edit the definitions to be a little bigger:

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Hi, thx for your reply. I set the new width to 150, but after import I get this:

    Error 0xc02020a1: Data Flow Task 1: Data conversion failed. The data conversion for column "Column 1" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".

    (SQL Server Import and Export Wizard)

    Error 0xc020902a: Data Flow Task 1: The "output column "Column 1" (14)" failed because truncation occurred, and the truncation row disposition on "output column "Column 1" (14)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component.

    (SQL Server Import and Export Wizard)

    Error 0xc0202092: Data Flow Task 1: An error occurred while processing file "D:\US.txt" on data row 1281.

    (SQL Server Import and Export Wizard)

    I notice you changed the setting on column 2, but I receive the error on column 1

    Im going crazy here!

  • what a pain that file is.

    this is what i did:

    i converted all teh damn unix LF to vbCrLf.

    then i used this script...i had to tweak it a few times, becasue it would fail on the ALTERNATENAMES column, as some of them have alternate names of more than 2000 chars!

    i gave up, made it a nvarcharmax, and not this script works.

    if you need the file in PC format, it's 55 meg and i put a copy here:

    US_PC_Formatted.zip

    this was my code, using BULK INSERT to get all 2,082,844 rows

    --drop table [dbo].[CITIES_GEONAMES]

    CREATE TABLE [dbo].[CITIES_GEONAMES] (

    [GEONAMEID] INT NULL,

    [NAME] NVARCHAR(max) NULL,

    [ASCIINAME] NVARCHAR(max) NULL,

    [ALTERNATENAMES] NVARCHAR(max) NULL,

    [LATITUDE] FLOAT NULL,

    [LONGITUDE] FLOAT NULL,

    [FEATURE CLASS] NVARCHAR(510) NULL,

    [FEATURE CODE] NVARCHAR(510) NULL,

    [COUNTRY CODE] NVARCHAR(510) NULL,

    [CC2] NVARCHAR(510) NULL,

    [ADMIN1 CODE] NVARCHAR(510) NULL,

    [ADMIN2 CODE] NVARCHAR(510) NULL,

    [ADMIN3 CODE] NVARCHAR(510) NULL,

    [ADMIN4 CODE] NVARCHAR(510) NULL,

    [POPULATION] NVARCHAR(510) NULL,

    [ELEVATION] NVARCHAR(510) NULL,

    [GTOPO30] FLOAT NULL,

    [TIMEZONE] NVARCHAR(510) NULL,

    [MODIFICATION DATE] DATETIME NULL)

    BULK INSERT [dbo].[CITIES_GEONAMES] FROM 'c:\Data\US.txt'

    WITH (

    DATAFILETYPE = 'char',

    FIELDTERMINATOR = '\t',

    ROWTERMINATOR = '\n',

    FIRSTROW = 1

    )

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Gosh, folks. I've not tried to import the data (yet) but the "Read Me" pointed out in the original posts seems to identify column requirements quite nicely. I've highlighted the requirements for the troublesome column. Is this just a "PICNIC" problem? 🙂

    [font="Courier New"]The main 'geoname' table has the following fields :

    ---------------------------------------------------

    geonameid : integer id of record in geonames database

    name : name of geographical point (utf8) varchar(200)

    asciiname : name of geographical point in plain ascii characters, varchar(200)

    [highlight]alternatenames : alternatenames, comma separated varchar(5000)[/highlight]

    latitude : latitude in decimal degrees (wgs84)

    longitude : longitude in decimal degrees (wgs84)

    feature class : see http://www.geonames.org/export/codes.html, char(1)

    feature code : see http://www.geonames.org/export/codes.html, varchar(10)

    country code : ISO-3166 2-letter country code, 2 characters

    cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 60 characters

    admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)

    admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)

    admin3 code : code for third level administrative division, varchar(20)

    admin4 code : code for fourth level administrative division, varchar(20)

    population : bigint (8 byte int)

    elevation : in meters, integer

    gtopo30 : average elevation of 30'x30' (ca 900mx900m) area in meters, integer

    timezone : the timezone id (see file timeZone.txt)

    modification date : date of last modification in yyyy-MM-dd format[/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (5/24/2011)


    Gosh, folks. I've not tried to import the data (yet) but the "Read Me" pointed out in the original posts seems to identify column requirements quite nicely. I've highlighted the requirements for the troublesome column. Is this just a "PICNIC" problem? 🙂

    I admin I have that Y chromosome disorder about asking directions, I guess it leaked over onto the RTFM rules too!

    Thanks Jeff!

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (5/25/2011)


    Jeff Moden (5/24/2011)


    Gosh, folks. I've not tried to import the data (yet) but the "Read Me" pointed out in the original posts seems to identify column requirements quite nicely. I've highlighted the requirements for the troublesome column. Is this just a "PICNIC" problem? 🙂

    I admin I have that Y chromosome disorder about asking directions, I guess it leaked over onto the RTFM rules too!

    Thanks Jeff!

    BWAA-HAAA!!! I guess that's why it's the "Y" chromosome... it stands for "Why ask Why". 😀 We all suffer from it. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • @jeff: thanks for pointing out my own laziness 😉

    @lowell: You code almost does what I need:

    BULK INSERT [dbo].[CITIES_GEONAMES] FROM 'D:\US.txt'

    WITH (

    DATAFILETYPE = 'char',

    FIELDTERMINATOR = '\t',

    ROWTERMINATOR = '',

    FIRSTROW = 1

    )

    But my table has 1 extra column:

    provinceid int Checked (default value 0)

    On that extra column an error is thrown:

    Msg 4864, Level 16, State 1, Line 3

    Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 20 (provinceid).

    How can I bulk insert the text file, where the default value for provinceid is used?

    Thanks!

  • the columns in the table and the columns in the BULK insert must match exactly, with no extra or missing columns....unless you use a format file.

    you can use OPENROWSET to bulk insert, and select specific column names, but that requires that same format file.

    INSERT INTO US(ColumnList)

    SELECT ColumnList

    FROM OPENROWSET(BULK N'c:\Data\US.txt', FORMATFILE='c:\Data\US.fmt') as myAliasBulk

    since my code is tested and working, i'd just insert into a temporary table with my table definition, then insert from that SQL table into your permanant table.

    easier and faster, i'd say.

    or bulk insert, then add the column after you are done.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (5/25/2011)


    the columns in the table and the columns in the BULK insert must match exactly, with no extra or missing columns....unless you use a format file.

    As a possible alternative to format files, create a view and Bulk Insert into the view. 🙂 It doesn't solve the problem of not importing all the columns but will get you around such things as the target table having an IDENTITY column, etc, without having to build a format file.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • As a side bar, I absolutely agree with Lowell's recommendation. I never import directly into the "final" table. I always use a staging table (Temp Table or otherwise) so that I can validate the data before it goes anywhere near production.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Guys, thank you both!!

    @lowell: I did what you suggested, imported it into a temp table first 🙂

  • After way too many hours of 'practice' with the 7.7million rows in GeoName's allcountries.txt I can pass on a couple of tips:

    Lowell (5/24/2011)


    what a pain that file is.

    this is what i did:

    i converted all teh damn unix LF to vbCrLf.

    then i used this script...i had to tweak it a few times, becasue it would fail on the ALTERNATENAMES column, as some of them have alternate names of more than 2000 chars!

    i gave up, made it a nvarcharmax, and not this script works.

    To get around the Unix LF problem, this works:

    DECLARE @bulk_cmd varchar(1000)SET @bulk_cmd = 'BULK INSERT [dbo].[YourTableName]FROM ''c:\allCountries.txt'' WITH (ROWTERMINATOR = ''' + CHAR(10) + ''')'

    EXEC(@bulk_cmd)

    The above inserts them into the table Lowell outlined earlier. However I wanted my GeoName table to be a closer match on datatype & size, using varchar(min_size_necessary) etc.

    Despite what's stated in the GeoName readme:

    alternatenames : alternatenames, comma separated varchar(5000)

    current max length for the 'alternatenames' column is 6171 characters...the rest however all fit within the Geonames stated field size. ( My tweaked table has both extra and skipped fields so I won't script it here )

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply