How to store lots of data (1024 columns) for a particular reading

  • Just to clarify... you said the range would be from 0 to 50000 counts (I think I know what that means but you might want to clarify). So would you actually need scientific notation with exponent values as high as e56 such as in the example you gave above or can we get away with integers?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (11/21/2011)


    Just to clarify... you said the range would be from 0 to 50000 counts (I think I know what that means but you might want to clarify). So would you actually need scientific notation with exponent values as high as e56 such as in the example you gave above?

    No, I would not assume so. That was just an example of the format we would have stored the strings as so that they were easily parsed back into data by a front end if that solution was used. The numbers I used were just place holders. It could just as easily been expressed as x.xxxexx. I believe a reasonable solution could be figured out assuming the highest count will be 50 000. If its bigger I will just have to change the size of the binary data type. Our boss always wants things in the most flexible form so that is why that format was chosen.

  • Using your single row array example of Array { 10,200,3000,40000 }, the data in a "concatenated binary" in SQL Server would look like the following in SQL Server (assuming 4 byte integer values which covers up to ~2 Billion)...

    RESULT:

    0x0000000A000000C800000BB800009C40

    |------||------||------||------|

    10 200 3000 40000

    For a bit of visual clarification of the same value, let's remove the "0x" prefix (which does NOT exist as part of the actual data) and add some spacing so you can see the fixed field format of the binary encoded data....

    Binary "prefix" remove and spaces added for clarity

    0000000A 000000C8 00000BB8 00009C40

    -------- -------- -------- --------

    4 bytes 4 bytes 4 bytes 4 bytes

    To make a bit of "test" data consisting of 10 binary encoded elements in each row for 10 rows, we could use the following code (all the elements are random numbers 0 <= N < 1000000 in this case)...

    SELECT TOP 10

    CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    + CAST(ABS(CHECKSUM(NEWID()))%1000000 AS BINARY(4))

    FROM sys.All_Columns

    The result of that bit of random computational heaven is...

    0x00081997000C3595000A2291000338440004C4D40001A57B0003C469000C4AE30004F72400002F4A

    0x0004FC3000062F920008C7AC000C831A00094E350003BBC7000C829B00068CB9000473460002B230

    0x000910BD0005DDAC000247E2000E841E000F2D85000DD92A00006EBD000A3ED8000C411A00045D20

    0x000C35560008396F000CB99F0008F4F8000C167700033D2C00082D270000F1530008A3ED000CF026

    0x0007DE31000CD9F7000B6F5A000D49F9000B6C44000864000004DB71000BFB9700028E690006A6FB

    0x0005727D0009975C000CE8FA0000A8DC0005FC6200075A6F000EB61A0003D949000BD0E40005DE97

    0x00054E6D0000A4E700036807000C5A1A0008DB3B000338550000330B000CAE9500039C120000BE94

    0x000922D800044A840005EA57000730380005D41A00014215000CAC2D00069A3700054674000B6E46

    0x000A65CD0006AFE7000E3BE50005A09C00074A8D0003959B0003A33100062ED2000B1639000CD914

    0x000091FA0005D75100081F480004DD89000AB68B000C84750003CE77000819400009674E0009BBED

    ... and I believe that you be able to see the fixed fields (4 bytes or 8 nibbles each)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Heh... almost forgot. Now you really know why the term "binary" here is a huge misnomer... the data returned in SQL Server is stored in a binary format (as is all data) and displayed as hexidecimal strings of nibbles. it confused even me when I first started playing with it many years ago. I'd look at it and say "Shoot... THAT's not binary". And, no, it doesn't follow the typical LSB...MSB format that you would think binary data would be stored in as it would be if you looked at the memory locations the data was stored in for a typical "double" datatype.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Ah yes, ok I can work with this. Thanks for the example. I will be getting back at this tomorrow.

    EDIT:

    Jeff Moden (11/21/2011)


    Heh... almost forgot. Now you really know why the term "binary" here is a huge misnomer... the data returned in SQL Server is stored in a binary format (as is all data) and display as hexidecimal strings of nibbles.

    Yes I was quite confused at first when it was being displayed in HEX. I understand this is not how it is stored however.

  • You bet. I know you said you only needed to provide the data and not do any analysis but, if you ever do, post back because I've already got stuff to solve your "problems" with splitting and using the data in a fairly high performance manner.

    As a side bar, I recommend your end user get a copy of the free SQL Express (free version of SQL Server) if they ever need to use the data.

    If they really want something to be spreadsheet friendly, we can certainly take another approach such as TAB delimited data or even importable HTML data.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • loki1049 (11/21/2011)


    Ah yes, ok I can work with this. Thanks for the example. I will be getting back at this tomorrow.

    EDIT:

    Jeff Moden (11/21/2011)


    Heh... almost forgot. Now you really know why the term "binary" here is a huge misnomer... the data returned in SQL Server is stored in a binary format (as is all data) and display as hexidecimal strings of nibbles.

    Yes I was quite confused at first when it was being displayed in HEX. I understand this is not how it is stored however.

    It CAN, however, be transmitted pretty much as you see it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (11/21/2011)


    opc.three (11/20/2011)


    That said, I am not convinced that a small group of "readings tables" that make up 1024 readings columns would be so bad here (group of tables necessary given the 1024 column-limit).

    No problem... using that thought, write a query for a given run (assume a paltry 100,000 rows) that will search all 1024 channels on each "sample record" for a maximum value...

    Max of a comma-delimited string can be found quite easily in .NET using "Split" to create an Array from a comma-delimited list. A SQLCLR object might make sense here if there was a requirement to maintain all code in the data tier, otherwise this could be done before loading the data into the DB and either pushed to the DB directly or to enrich the flat-file prior to bulk loading.

    ...(usually in Lumens, Decibels, or Pascals depending on what is being measured) and then return that value...

    How would the unit of measurement matter in terms of finding a max, min, average and so forth within the same reading? Or were you speaking about the possibility of having to convert from one UOM to another on the fly (i.e. comparing the same column in different rows, not different columns in the same row)?

    ..., the sample number of that "record"...

    I am not sure what you mean by "sample number".

    ...and the channel number...

    An unpivot or massive CASE would be needed here...nasty. This one would be easier with a tall data model. This could also be done before or as your loading the data. Yes, this can be easily split using a tally table since you have ready access to the ordinal position (the N), but hey, each method has its pros and cons.

    ...and you'll understand why I suggested a fixed-field binary format. If that doesn't convince you,...

    No question about it...you have more experience in this domain of data however I have heard nothing on this thread that says to me that relational principles and tabular storage do not apply. Aside from having to work around the 1024 table-column limit (which is a SQL Server limitation, not a relational theory limitation), we're still talking about a flat record no matter whether you store the data in typed columns or as a binary "wad".

    An item of note, the 1024 column limit pertains to the storage engine. There is no such 1024 column limit on a tabular resultset formed in memory and delivered to a client meaning we can join two tables with 512+ columns and return all columns in one resultset.

    ...take the average, min, max, and standard deviation of each channel per second of run time. And it doesn't matter whether you're using SQL Server for the analysis or a GUI. 🙂

    I would pre-aggregate as much as I could here. I would either use a secondary table (a "stats" table of sorts) that was populated incrementally some time after data arrived. Or, if a real-time feel was needed that calculaation logic could be activated by a trigger. Persisted computed computed columns could also come in handy.

    Say you made 512 (a nice round number, at least in computer science) your column limit for a single table. I see 4 tables:

    dbo.reading_master: reading_id (surrogate key)

    device_id

    run date/time

    dbo.reading_detail_1: reading_id

    channel 1 reading

    ...

    channel 512 reading

    dbo.reading_detail_2: reading_id

    channel 513 reading

    ...

    channel 1024 reading

    dbo.reading_stat: reading_id

    average reading *

    min reading *

    max reading *

    reading stdev *

    * populated by stored procedures activated by 1) trigger on one of the reading_detail

    tables, or 2) auxiliary process on a schedule, or 3) "loader process" after readings are

    completely loaded, or n) I am sure there are other ways to do this efficiently.

    After writing all this and mentioning SQLCLR I started thinking that it may make sense to create a SQLCLR Type to represent our "spectrometer reading". This type could have methods in it that returned any channel with a method that accepted the channel number. It could also have methods to return useful aggregates like max, min, stdev, etc., and the result of those methods could be persisted and indexed. I am not sure if any or all of the methods on the CLR type could be setup such that they would be considered deterministic by the engine so a persisted computed column may not be an option, but at the very least the results of those methods could be manually stored and indexed in a concrete column. That would quell my "typed columns" concern, and would put some structure around "wad" of binary data (SQLCLR types are stored as binary under the covers, a serialize version of CLR object itself with all it's properties).

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (11/22/2011)


    Max of a comma-delimited string can be found quite easily in .NET using "Split" to create an Array from a comma-delimited list

    Precisely my point. And it would be even faster using (say) 4 or 5 bytes for every element because it wouldn't have to search for a delimiter. Try doing the same elegant thing with 1024 individual columns even if they fit in one table.

    So far as the UOM's go... It was only to identify that the max would be relatively small in terms of how big the numbers may be. For the radiation levels the OP is talking about, it ends up as a max of 50000. If it were DB, it could easily be as small as one byte (0-255 max) per channel.

    No question about it...you have more experience in this domain of data however I have heard nothing on this thread that says to me that relational principles and tabular storage do not apply

    I've not suggested otherwise. People just aren't used to "relational principles" and "tablular storage" being applied to what used to be the original format for databases... fixed field strings. 🙂 If you look at an actual page of, say, INTEGERs, you won't find much difference between the binary blob I propose or the page of INTEGERs. It's just easier to work with the blobs than it is raw page data in SQL Server.

    In fact, now that I've said that, I'm thinking that there's no reason for anything in SQL Server and not much for an app to do. Just capture the output of the spectrometer in a file with a run header file and fixed field line (sample) numbers (each line of output is called a "sample" as well a other names).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (11/22/2011)


    opc.three (11/22/2011)


    Max of a comma-delimited string can be found quite easily in .NET using "Split" to create an Array from a comma-delimited list

    Precisely my point. And it would be even faster using (say) 4 or 5 bytes for every element because it wouldn't have to search for a delimiter. Try doing the same elegant thing with 1024 individual columns even if they fit in one table.

    I understand your approach and I agree it is faster to ditch the delimiter and use fixed widths within your binary column. Naturally the tally table split plays right into that. BTW I did try taking the max of 1024 columns in a single row...nasty proposition. That is why I said I would cheat as much as possible and pre-aggregate max, min, avg, stdev, etc. maybe even leveraging whatever loader process was being developed leading me to the .net array split comment 😀

    No question about it...you have more experience in this domain of data however I have heard nothing on this thread that says to me that relational principles and tabular storage do not apply

    I've not suggested otherwise. People just aren't used to "relational principles" and "tablular storage" being applied to what used to be the original format for databases... fixed field strings. 🙂 If you look at an actual page of, say, INTEGERs, you won't find much difference between the binary blob I propose or the page of INTEGERs. It's just easier to work with the blobs than it is raw page data in SQL Server.

    Amen to not working directly with pages and I agree with you that fixed-field strings aren't the first thing you think of when data modeling for a RDBMS 🙂 It did not enter my thoughts at all but after your post I remembered in college I had to write a C-based database where my persistent storage was a text file...fseek...grrr...then after that torturous experience they redeemed themselves and issued me logins to shared Oracle and SQL server instances.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • So looking at this more, it seems we have to store it as a comma delimited string in an NVARCHAR(MAX) column due to the strings length*. Anyhow, its a similar scenario because we are storing a huge 'BLOB' of data and then need some way to parse it. So we I have tried to split the string, which will be in the form of "X.XXEXX,X.XXEXX,X.XXEXX," but using the split function like say the DelimitedSplit8k will not work easily due to the 8k limit on this splitter. I have tried something like the following with Jeff's (and other creators) latest version of the delimited split 8 k (Note also that the number of channel numbers has increased from 1024 to 2048, I had my facts wrong.):

    DECLARE @StringToSplit NVARCHAR(MAX)

    --== Create test string to split

    SELECT @StringToSplit = REPLICATE('5.00E04,',2048)

    --== Remove the trailing delimiter

    SELECT @StringToSplit = LEFT(@StringToSplit,LEN(@StringToSplit)-1)

    --== Format data in tall table format

    SELECT

    ChannelNumber = ItemNumber,

    RawChannelData = Item,

    Counts = CAST(LEFT(Item,4)AS FLOAT) * POWER(10,CAST(RIGHT(Item,2)AS FLOAT))

    FROM dbo.DelimitedSplit8k(@StringToSplit,',')

    The DelimitedSplit8k function can be found here: http://qa.sqlservercentral.com/articles/Tally+Table/72993/

    The only way I can seem to make this work is to use a bunch of NVARCHAR(4000) strings and split the large string amongst them. Then use this split function on the each of the small strings and UNION the results together while at the same time modifying the ItemNumber of each query to keep the ChannelNumber in sync (otherwise the unions will replicate the channel numbers). I might have to look into this CLR business. I assume this is a way to write the equivalence of .NET language in a SQL contruct?

    *The reason we can't store as a binary blob is because data is being sent to the database over a SOAP message to a webservice and I am currently not sure on what form to send the data into the webservice as a soap request is usually strings and ints. This might get sorted out in the future.

  • loki1049 (11/25/2011)


    I assume this is a way to write the equivalence of .NET language in a SQL contruct?

    That's pretty much the gist of it...SQLCLR allows us to implement various SQL Server objects using a CLR language. At present CLR 2.0 is supported by default in 2005, 2008 and 2008R2. SQL Server 2012 introduces support for CLR 4.0. The SQLCLR objects presents a T-SQL interface to all other database objects however the SQLCLR object is actually implemented in a CLR-compliant language. Generically, some things you can implement using SQLCLR are User-defined Types, Scalar Functions, Table-valued functions, Aggregates and Stored Procedures. A concrete example is the Geography Type added in SQL 2008 R2, but a la Microsoft SQLCLR is a platform for you to extend the T-SQL programming domain.

    In my previous post, my mention of SQLCLR was pointing towards creating a SQLCLR Type, let's call it "SpectrometerReading", whose constructor accepts whatever is coming from your data supplier, whether that be a comma-separated list or something else. Internally the data provided to the constructor would be represented as a "Collection" (maybe a dictionary<int,decimal>). Then, properties of the SpectrometerReading Type would offer things like MinReading, MinChannel, MaxReading, MaxChannel, StDev, etc. that operated on the internal Collection of readings.

    Note: The comments in the following paragraph are only theoretical. I have worked with SQLCLR a fair bit, but not in this capacity specifically. As for delivering the results as a tabular resultset this is be new territory for me...but I was thinking of creating a CLR Table-valued function that accepted one parameter of type SpectrometerReading that returned the Collection as a single row with 1024-columns. You could then pull out the rows from a column of type SpectrometerReading in an entire table using CROSS APPLY.

    I had planned on working up a POC of the aforementioned SQLCLR Type and Table-valued function even before you posted, just out of my own curiosity. If I get to it I will post my findings here. Please do the same if you attempt to use SQLCLR.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • loki1049 (11/25/2011)


    So we I have tried to split the string, which will be in the form of "X.XXEXX,X.XXEXX,X.XXEXX," but using the split function like say the DelimitedSplit8k will not work easily due to the 8k limit on this splitter.

    You said the domain of numbers produced by the spectrometer would be from 0 to 50,000 so I don't understand why you need to use the relative complexity of scientific notation. 50,000 will easily fit in 3 bytes. 2,048 times 3 is only 6,144 bytes and it can all be stored without the complexity of formatting or delimiters.

    If you insist on using formatted delimited data to store fixed width data, then use Paul White's splitter CLR which is also available in the article previously cited. It will split on a single delimiter up to ~2 billion bytes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (11/26/2011)


    loki1049 (11/25/2011)


    So we I have tried to split the string, which will be in the form of "X.XXEXX,X.XXEXX,X.XXEXX," but using the split function like say the DelimitedSplit8k will not work easily due to the 8k limit on this splitter.

    You said the domain of numbers produced by the spectrometer would be from 0 to 50,000 so I don't understand why you need to use the relative complexity of scientific notation. 50,000 will easily fit in 3 bytes. 2,048 times 3 is only 6,144 bytes and it can all be stored without the complexity of formatting or delimiters.

    If you insist on using formatted delimited data to store fixed width data, then use Paul White's splitter CLR which is also available in the article previously cited. It will split on a single delimiter up to ~2 billion bytes.

    If I was in a position to the make the decision, I would definitely store as a binary BLOB, but I'm not. As often happens, management is making decisions in the realm of the unknown and is stubborn. I guess they are afraid of placing a constraint on the project by limiting the the magnitude of the count readings in any logical way, hence the desire for the extremely large scientific representation of counts.

    The other reason, is my internal opinion on how to get this data to the database. We are using a SOAP protocol, which is ideal for sending strings and ints. Sending a binary stream of data must be done with an attachment style coding which is beyond my current knowledge and time is running out on this project. I suppose I could send a string of ones and zeros and covert it into a byte array by the web-service, but that seems less than ideal. I believe working with a CLR is going to be the easier way.

  • I went ahead and created a couple different CLR types. This being my first try in this area I was not sure whether calculating everything from the outset and storing/serializing those values would be faster than re-calculating the values each time the values were needed (i.e. selected). As it turns out either serialization, deserialization or a combination of both processes add a significant enough amount of overhead to where is makes sense to recalculate the values (e.g. Sum, Max, Avg, StdDev, etc.) each time they are necessary. The two objects are named SpectrometerReadingFrontLoaded and SpectrometerReadingLazy. Have they been exhaustively tested? No, by no means, but they should be a good start. By the way, I went away from the table-valued UDF idea because SQL Server balked when the resultset exceeded 1024 columns. As I mentioned earlier ad hoc queries can have more than 1024 columns. As I recently also learned, queries returned from a stored procedure can have more than 1024 columns.

    Attached is SpectrometerReading.zip containing the following items of note:

    1. Directory "\SpectrometerReading" contains the SQL CLR Project developed in VS2010+SP1

    2. Directory "\SpectrometerReading\Test Scripts" contains SQL scripts to build up some test data and exercise the CLR Types.

    3. File "\Install SpectrometerReading CLR Objects.sql" allows you to install the CLR objects to a database without VS 2010 or any C# experience. The binary CLR objects are serialized within the SQL script as ANSI text so they can be compiled into SQL Server as plain old T-SQL.

    If you have any questions or comments about the code or otherwise make any progress towards a solution please post back.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply