EXCEL HELL!!!

  • When people tell me they have spreadsheets they want importing into sql, I usually ask them if they can supply it as csv because it saves any problems with differences between Excel 2003 and 2010 (we have both in this business) and it helps by delimiting the text fields where commas and quotes and other fancy characters often get used.

    That said we do often import spreadsheets in SSIS, but you have to remember that the excel connector decides the field types based on the first ten rows of the spreadsheet and that can be completely the wrong thing.

  • P Jones (3/5/2015)


    That said we do often import spreadsheets in SSIS, but you have to remember that the excel connector decides the field types based on the first ten rows of the spreadsheet and that can be completely the wrong thing.

    There is a registry setting that you can change to make it scan more rows.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (3/5/2015)


    P Jones (3/5/2015)


    That said we do often import spreadsheets in SSIS, but you have to remember that the excel connector decides the field types based on the first ten rows of the spreadsheet and that can be completely the wrong thing.

    There is a registry setting that you can change to make it scan more rows.

    IIRC, the problem with that is that if you have multiple header rows (for example), it will leave some of them as blank.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • P Jones (3/5/2015)


    When people tell me they have spreadsheets they want importing into sql, I usually ask them if they can supply it as csv because it saves any problems with differences between Excel 2003 and 2010 (we have both in this business) and it helps by delimiting the text fields where commas and quotes and other fancy characters often get used.

    That said we do often import spreadsheets in SSIS, but you have to remember that the excel connector decides the field types based on the first ten rows of the spreadsheet and that can be completely the wrong thing.

    Thanks for the feedback on this.

    Not that it's necessarily a bad idea (especially since the DA group at work has some of our customers to the same thing) but I have (a long time ago, in fact) stopped asking for users to export the data to CSV or TSV because the method of just doing a file save as one of those two types doesn't buy me a whole lot insofar as reducing any complexity. It also adds the problem of Excel putting quotes around only those "cells" that have commas in them, which BCP (one of the tools I prefer) doesn't handle so well. Exporting as CSV/TSV also produces a whole bunch of blank lines and extra tab columns unless the spreadsheet is slammed to the top left corner of the spreadsheet. The ACE drivers are a bit more forgiving and logical in both of those areas and saves quite a bit of prep work for me especially since most users have a really difficult time understanding written instructions. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • "The ACE drivers"

    So Access could be used instead to solve the problem outlined in the opening post?

  • I'm not so sure that I'd use Access to pump a spreadsheet into SQL Server when SQL Server can do it just fine but, sure, it could also be done in Access.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • How about not pumping anything into SQL Server and do all the processing and output in Access?

  • grovelli-262555 (3/6/2015)


    How about not pumping anything into SQL Server and do all the processing and output in Access?

    That's what I said in my previous post. 😉

    But, my goal isn't to get this stuff into Access. My goal is to do the import to SQL Server where I need it (and I've got that).

    Have fun with Access. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thank you all for this "debate"

    I would nether recommend anybody to use SSIS or an ACE driver in order to import sheets in native Excel file format. This may work on your on PC but if you work as a Consultant you will not get payed for the problems you will meet.

    If the task is just to import a table of data in a sheet save the content as a TAB separated textfile. A simple SaveAS macro can do that for you. Then use BULK INSERT, T SQL. If you want to import parts of data in many sheets do it with VBA in Excel.

    Gosta M

  • Gosta Munktell (3/7/2015)


    Thank you all for this "debate"

    I would nether recommend anybody to use SSIS or an ACE driver in order to import sheets in native Excel file format. This may work on your on PC but if you work as a Consultant you will not get payed for the problems you will meet.

    If the task is just to import a table of data in a sheet save the content as a TAB separated textfile. A simple SaveAS macro can do that for you. Then use BULK INSERT, T SQL. If you want to import parts of data in many sheets do it with VBA in Excel.

    Gosta M

    Gosta Munktell (3/7/2015)


    Thank you all for this "debate"

    I would nether recommend anybody to use SSIS or an ACE driver in order to import sheets in native Excel file format. This may work on your on PC but if you work as a Consultant you will not get payed for the problems you will meet.

    If the task is just to import a table of data in a sheet save the content as a TAB separated textfile. A simple SaveAS macro can do that for you. Then use BULK INSERT, T SQL. If you want to import parts of data in many sheets do it with VBA in Excel.

    Gosta M

    Thanks for the feedback but, gosh... I have to disagree almost all the way around on all that.

    First, VBA adds to the problem of the proverbial "Tower of Babel". It's yet another language and another skill set required to do something simple instead of just using T-SQL, which is always available if you have SQL Server. Further, it requires distribution, maintenance, and redistribution of "controlled" spreadsheets and will cause more work than even writing extremely complex SQL (and it's not complex at all).

    As to saving the content as a TAB separated textfile, let's see how well that works. I've attached a slightly simpler spreadsheet and the resulting TSV file that it produced. Let's see how easy it is for you to use BULK INSERT to actually be able to do something with the data in the TSV. Notice the double quotes that SQL Server places around the cell data if it has a comma in it even if it was exported as TSV. If you do the same thing with the ACE drivers, those problems are automatically resolved because that's what it was built to do.

    And, keep in mind that we're not trying to duplicate the structure of the spreadsheet in a table. What we really want is one row per data element from the spreadsheet.

    Last but not least, you simply cannot trust users to correctly save a TSV or CSV file. You'll find that the attached TSV file has some anomalies in it because the user just did a save-as instead of highlighting what was to be saved.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Good stuff here on the suggestions. Here is an odd one from me that I didn't try, but I think it would work similar to Eirikurs.

    Create a data flow using Excel Source with a single file connection to the Excel document. First pass will rip out the Class Scores as Label. The second pass will rip out the header years, third pass will rip out the column headers and the fourth pass for the values. Use of conditional splits can be used if Excel Source openrow is set to read the entire row to catch 2014, 2015, onward along with all the months. Formulation of a structure can be as exampled:

    Label, Year, Month, Name, ID, Status, Phy, Mat, Chem

    This only leaves the Totals missing, which can be snagged too. But, you can also add a SQL Task to calculate the totals as the final step too. 😀

    Don't know if that works. It sounds good in my head. I feel it would be simple to achieve quickly in SSIS. But then again, I play drums well in my head too and I'm a bad drummer. :hehe:

  • Jeff Moden (3/7/2015)


    Gosta Munktell (3/7/2015)


    Thank you all for this "debate"

    I would nether recommend anybody to use SSIS or an ACE driver in order to import sheets in native Excel file format. This may work on your on PC but if you work as a Consultant you will not get payed for the problems you will meet.

    If the task is just to import a table of data in a sheet save the content as a TAB separated textfile. A simple SaveAS macro can do that for you. Then use BULK INSERT, T SQL. If you want to import parts of data in many sheets do it with VBA in Excel.

    Gosta M

    Gosta Munktell (3/7/2015)


    Thank you all for this "debate"

    I would nether recommend anybody to use SSIS or an ACE driver in order to import sheets in native Excel file format. This may work on your on PC but if you work as a Consultant you will not get payed for the problems you will meet.

    If the task is just to import a table of data in a sheet save the content as a TAB separated textfile. A simple SaveAS macro can do that for you. Then use BULK INSERT, T SQL. If you want to import parts of data in many sheets do it with VBA in Excel.

    Gosta M

    Thanks for the feedback but, gosh... I have to disagree almost all the way around on all that.

    First, VBA adds to the problem of the proverbial "Tower of Babel". It's yet another language and another skill set required to do something simple instead of just using T-SQL, which is always available if you have SQL Server. Further, it requires distribution, maintenance, and redistribution of "controlled" spreadsheets and will cause more work than even writing extremely complex SQL (and it's not complex at all).

    As to saving the content as a TAB separated textfile, let's see how well that works. I've attached a slightly simpler spreadsheet and the resulting TSV file that it produced. Let's see how easy it is for you to use BULK INSERT to actually be able to do something with the data in the TSV. Notice the double quotes that SQL Server places around the cell data if it has a comma in it even if it was exported as TSV. If you do the same thing with the ACE drivers, those problems are automatically resolved because that's what it was built to do.

    And, keep in mind that we're not trying to duplicate the structure of the spreadsheet in a table. What we really want is one row per data element from the spreadsheet.

    Last but not least, you simply cannot trust users to correctly save a TSV or CSV file. You'll find that the attached TSV file has some anomalies in it because the user just did a save-as instead of highlighting what was to be saved.

    Totally agree with Jeff, here. I spent months untangling a file loading process that was written first by a vb guy, taken over by a Java guy, and made a Python guy quit. It's all in SQL now, aside from a Windows scheduled task to grab files from a local SFTP server, because we just could NOT get the permissions to work via SP/Agent Job.

    Moral of the story? You can't ask project managers to edit massive JSON files to load things into SQL.

    :hehe:

  • Ain't that the truth. When I took my first SQL Server job at the company I am at today, the previous guy wrote everything in Access and VBA code. I replaced it all with the help of another great SQL guy all in TSQL code. People were amazed how easy it was to replace the same objectives VBA was being used to complete with the use of all TSQL.

    In theory, we were only replacing VBA with SQL right. It only comes down to a benefit of VBA versus SQL. But the thing, VBA help was harder to find and harder to manage than TSQL, which is easier to manage and easier to find help for. Then there was the issue it was all tied to a MS Access program that is on the client end. It was restricted to a clients machine as opposed to being on the backend where it should belong as a service. TSQL allowed us to move it to SQL Server, which is on the backend and could run anywhere SQL Server lived without impacting the client. It worked out great.

    BUTTTTT, don't you pick on Python. I'm sadden you made a python developer quit. Python is my language and it's FAAAARR more awesome than TSQL 😛

  • xsevensinzx (3/7/2015)


    Good stuff here on the suggestions. Here is an odd one from me that I didn't try, but I think it would work similar to Eirikurs.

    Create a data flow using Excel Source with a single file connection to the Excel document. First pass will rip out the Class Scores as Label. The second pass will rip out the header years, third pass will rip out the column headers and the fourth pass for the values. Use of conditional splits can be used if Excel Source openrow is set to read the entire row to catch 2014, 2015, onward along with all the months. Formulation of a structure can be as exampled:

    Label, Year, Month, Name, ID, Status, Phy, Mat, Chem

    This only leaves the Totals missing, which can be snagged too. But, you can also add a SQL Task to calculate the totals as the final step too. 😀

    Don't know if that works. It sounds good in my head. I feel it would be simple to achieve quickly in SSIS. But then again, I play drums well in my head too and I'm a bad drummer. :hehe:

    Now that's closer to what I'm talking about. Something where you don't need to change anything as monthly "partitions" of columns are added to the spreadsheet.

    The only thing that I'd make a change on is to make the classes the same way so that you have some consistency on the SQL Server side of the house as classes are added. I'd also combine the Year and Month For example...

    Name, ID, Status, Date, ClassName, Score

    Yeah... that's horribly denormalized but that would be the first step to getting the data into the server. This would be a staging table that you could produce a couple of normalized tables from but considering that you load the whole spreadsheet every time, you have to start somewhere.

    I've just finished building my presentation on the subject of how it can all be done in T-SQL. The code is remarkably simple for what it does and how it accounts for they dynamics of spreadsheets that grow and change over time. There are 4 small sections to the code.

    1. Import the raw data from the spreadsheet (name can be changed as a parameter).

    2. Get the meta-data from the temp table so that we can equate the F* column names to the column names embedded in the spreadsheet.

    3. Get the monthly "partition" headers, combine them, and smear them into the meta-data temp table so that each row (each row represents an F* column)

    4. Using the meta-data table and the original raw data, unpivot the data into an EAV table where each row represents a single cell value from the spreadsheet.

    The code is in a "Proof-of-Principle" status (IMHO) but will currently take ANY spreadsheet that follows the general form of 2 "partitioning" header rows (no matter what they are and no matter if there are horizontal merges or not), a main header row, and then data. The next step will be to make it a little bit more intelligent so that maybe it can auto-magically figure out how many "partitioning" header rows there are, a parameter to accept sheet names (maybe even do a little auto-discovery of that but not sure of the driver for such purposes).

    One of the neat parts is, it requires no interaction and, so, no training of the user that is the owner of the spreadsheet and no distribution of boilerplate spreadsheet templates. The other neat part is that, with the exception of making a call to the ACE driver/provider, it's all T-SQL. And, it handles the "embedded comma" problem without having to anticipate it. It just does it.

    I'm giving the presentation at the Local PASS chapter this Thursday. If it pans out well, I'll write an article on it.

    Thanks for the tips on the SSIS side of the house. I figured it wasn't as easy in SSIS as some people outside this thread were saying but had to be sure. You folks are the best!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Just chipping in a little bit, one of the bigger annoyances when importing data from Excel spreadsheets is the lack of metadata functions/view, think thousands of files with a time stamp in the worksheet name:pinch: Dealt with this in the passed by reversing the BIFF5 format, came across similar situation recently so I re-wrote from memory (as I couldn't find the original) the initial procedure and added support for the XSLX. The code is below, enjoy

    😎

    USE Test;

    GO

    GO

    CREATE PROCEDURE dbo.EE_LIST_EXCEL_WORKSHEETS

    (

    @FILE_PATH NVARCHAR(2048) = NULL

    )

    AS

    /*

    Procedure dbo.EE_LIST_EXCEL_WORKSHEETS

    Initial coding ca. 1996, Eirikur Eiriksson

    Re-written 2015-03-03

    Description: List visible worksheets in an Excel workbook (BIFF8/XLSX)

    Dependencies: The XLSX file format is a zipped multi document XML, the procedure

    needs an decompression CLR function, the one used here is a part of

    the SQL# from http://www.sqlsharp.com/

    USAGE

    DECLARE @FILE_PATH NVARCHAR(2048) = N'C:\INCOMING_DATA\DROP_FOLDER\Sample Spreadsheet for Import.xlsx';

    EXEC dbo.EE_LIST_EXCEL_WORKSHEETS @FILE_PATH;

    */

    SET NOCOUNT ON;

    /*

    Very simple input validation

    */

    IF @FILE_PATH IS NULL OR LEN(@FILE_PATH) < 6 RETURN;

    /*

    A table variable to store the binary file data

    */

    DECLARE @BINBLOB TABLE

    (

    BINDATA VARBINARY(MAX) NOT NULL

    ,IS_XML_ZIP AS (SIGN(CHARINDEX(0x504B030414,BINDATA,1)))

    ,IS_XSL AS (SIGN(CHARINDEX(0xD0CF11E0A1B11AE1,BINDATA,1)))

    );

    DECLARE @SQL_STR NVARCHAR(MAX) = N'SELECT x.BulkColumn FROM OPENROWSET(

    BULK ' + NCHAR(39) + @FILE_PATH + NCHAR(39) + N',

    SINGLE_BLOB) AS x;'

    INSERT INTO @BINBLOB (BINDATA)

    EXEC(@SQL_STR);

    IF (SELECT TOP(1) IS_XML_ZIP FROM @BINBLOB) = 1

    BEGIN

    /*

    In XSLX or compressed XML files the worksheet names are stored in a separate XML

    document called xl/workbook.xml

    */

    DECLARE @START_PAT VARCHAR(26) = 'xl/workbook.xml';

    DECLARE @END_PAT VARCHAR(26) = 'xl/';

    ;WITH XMLNAMESPACES (

    'http://schemas.openxmlformats.org/officeDocument/2006/relationships' AS R

    ,DEFAULT 'http://schemas.openxmlformats.org/spreadsheetml/2006/main')

    ,WORKBOOK_XML_START AS

    (

    SELECT

    CHARINDEX(@START_PAT, BB.BINDATA,1) AS X_POS

    ,BB.BINDATA

    FROM @BINBLOB BB

    WHERE BB.IS_XML_ZIP = 1

    )

    ,WORKBOOK_INFO AS

    (

    SELECT

    CONVERT(XML,[SQL#].[Util_Inflate](SUBSTRING(WXS.BINDATA,WXS.X_POS

    + LEN(@START_PAT)

    ,CHARINDEX(@END_PAT,WXS.BINDATA,WXS.X_POS

    + LEN(@START_PAT)) - (WXS.X_POS + LEN(@START_PAT))))

    ,0) AS WXML

    FROM WORKBOOK_XML_START WXS

    )

    SELECT

    WXML.DATA.value('@name','NVARCHAR(255)') AS WS_NAME

    FROM WORKBOOK_INFO WI

    CROSS APPLY WI.WXML.nodes('workbook/sheets/sheet') AS WXML(DATA)

    WHERE WXML.DATA.value('@state','NVARCHAR(255)') = N'visible';

    END

    ELSE IF (SELECT TOP(1) IS_XSL FROM @BINBLOB) = 1

    BEGIN

    /*

    BIFF7/8 (Excel 95-2003) have a BOUNDSHEET:Sheet Information structure that

    stores the worksheet names, the @BINPAT_START variable holds the record identifier

    */

    DECLARE @BINPAT_START BINARY(2) = 0x8500;

    ;WITH SHEET_NAME_ARRAY AS

    (

    SELECT

    CHARINDEX(@BINPAT_START,BB.BINDATA,1) AS X_POS

    ,BB.BINDATA

    FROM @BINBLOB BB

    WHERE BB.IS_XSL = 1

    UNION ALL

    SELECT

    CHARINDEX(@BINPAT_START,SNA.BINDATA,SNA.X_POS + 11)

    ,SNA.BINDATA

    FROM SHEET_NAME_ARRAY SNA

    WHERE SNA.X_POS > 0

    )

    ,BOUNDSHEET_ARRAY AS

    (

    SELECT

    SNA.X_POS

    ,SNA.BINDATA

    ,CONVERT(INT,SUBSTRING(SNA.BINDATA,SNA.X_POS + 8,2),0) AS IS_HIDDEN

    ,CONVERT(INT,SUBSTRING(SNA.BINDATA,SNA.X_POS + 10,1),0) AS NAME_LEN

    FROM SHEET_NAME_ARRAY SNA

    WHERE SNA.X_POS > 0

    )

    SELECT

    CONVERT(VARCHAR(256),SUBSTRING(BA.BINDATA,BA.X_POS + 12,BA.NAME_LEN),0) AS WS_NAME

    FROM BOUNDSHEET_ARRAY BA

    WHERE BA.IS_HIDDEN = 0;

    END

Viewing 15 posts - 16 through 30 (of 63 total)

You must be logged in to reply to this topic. Login to reply