Gaps and islands sql : Condense upstream data

  • Jacob Wilkins (4/29/2016)


    J Livingston SQL (4/29/2016)


    Hi Jacob.....

    on a larger data set I am getting

    A TOP or FETCH clause contains an invalid value.

    any ideas please?

    Ah, most likely you have a NULL for the start or end date. Just have to modify to handle NULLs appropriately.

    Cheers!

    bugger..test data script had NULLs in...thanks muchly.

    moving on for testing purposes....am I correct in thinking if my test data has "date/time" and not integers...I can use your code as follows:

    ---ignore previous conversion

    WITH

    covered_dates AS

    (

    SELECT *

    FROM tempdata

    CROSS APPLY

    (

    SELECT TOP(DATEDIFF(dd,StartDateKey,EndDateKey)+1)

    covered_date=DATEADD(dd,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,StartDateKey)

    FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n1(n) --An on-the-fly numbers table

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n2(n)

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n3(n)

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n4(n)

    )x

    ),

    distinct_numbered_covered_dates AS

    (

    SELECT *,RN=ROW_NUMBER() OVER (PARTITION BY BusName ORDER BY covered_Date ASC)

    FROM (

    SELECT DISTINCT BusName,BusType,covered_date

    FROM covered_dates

    )x

    )

    SELECT BusName,

    BusType,

    StartDateKey=MIN(covered_date),

    EndDateKey=MAX(covered_date)

    FROM distinct_numbered_covered_dates

    GROUP BY BusName,BusType,DATEADD(dd,-RN,covered_date)

    ORDER BY BusName,

    BusType,

    StartDateKey

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • my test script......responses appreciated !

    -- create some test data

    USE tempdb

    GO

    IF OBJECT_ID('tempdb..tempdata', 'U') IS NOT NULL

    DROP TABLE tempdb..tempdata ;

    WITH ctetrans as (

    SELECT TOP 5000000

    BusName = CHAR(Abs(Checksum(Newid())) % 26 + 65),

    BusType = CHAR(Abs(Checksum(Newid())) % 10 + 65)

    + CHAR(Abs(Checksum(Newid())) % 10 + 65),

    TransDate = Dateadd(dd, Abs(Checksum(Newid())) % Datediff(dd, '2012', '2016'), '2012')

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CROSS JOIN sys.all_columns ac3

    )

    ,

    ctetempdata as (

    SELECT BusName,

    BusType,

    TransDate AS StartDateKey,

    lead(transdate, 1, transdate ) OVER(PARTITION BY BusName, BusType ORDER BY TransDate) AS EndDatekey,

    ROW_NUMBER() OVER(PARTITION BY BusName, BusType ORDER BY TransDate) % 2 rnselect

    FROM ctetrans

    )

    SELECT BusName,

    BusType,

    StartDateKey,

    EndDatekey

    INTO tempdata

    FROM ctetempdata

    WHERE rnselect = 0

    -- see some testdata

    SELECT COUNT(*) FROM tempdata;

    SELECT BusName, BusType, StartDateKey, EndDatekey

    FROM tempdata

    WHERE (StartDateKey IS NULL) OR (EndDatekey IS NULL)

    SELECT BusName,

    BusType,

    StartDateKey,

    EndDatekey

    FROM tempdata

    WHERE(BusName = 'A') AND (BusType = 'AA')

    ORDER BY BusName, BusType, StartDateKey;

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • If the time matters (say, you have multiple rows per day), then I don't think this approach will work well.

    If you only care about date, then it works pretty well, but it does rely on only the date mattering.

    Cheers!

  • Jacob...dont seem to get the results expected from your code in comparison to mine.....have I misread your code:

    -- create some test data

    USE tempdb

    GO

    IF OBJECT_ID('tempdb..tempdata', 'U') IS NOT NULL

    DROP TABLE tempdb..tempdata ;

    WITH ctetrans as (

    SELECT TOP 500

    BusName = CHAR(Abs(Checksum(Newid())) % 2 + 65),

    BusType = CHAR(Abs(Checksum(Newid())) % 10 + 65)

    + CHAR(Abs(Checksum(Newid())) % 10 + 65),

    TransDate = CAST(Dateadd(dd, Abs(Checksum(Newid())) % Datediff(dd, '2012', '2013'), '2012') AS DATE)

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CROSS JOIN sys.all_columns ac3

    )

    ,

    ctetempdata as (

    SELECT BusName,

    BusType,

    TransDate AS StartDateKey,

    lead(transdate, 1, TransDate) OVER(PARTITION BY BusName, BusType ORDER BY TransDate) AS EndDatekey,

    ROW_NUMBER() OVER(PARTITION BY BusName, BusType ORDER BY TransDate) % 2 rnselect

    FROM ctetrans

    )

    SELECT BusName,

    BusType,

    StartDateKey,

    EndDatekey

    INTO tempdata

    FROM ctetempdata

    WHERE rnselect = 0

    -- see some testdata

    SELECT COUNT(*) FROM tempdata;

    --====================================================================================

    -- Jacob solution

    WITH

    covered_dates AS

    (

    SELECT *

    FROM tempdata

    CROSS APPLY

    (

    SELECT TOP(DATEDIFF(dd,StartDateKey,EndDateKey)+1)

    covered_date=DATEADD(dd,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,StartDateKey)

    FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n1(n) --An on-the-fly numbers table

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n2(n)

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n3(n)

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n4(n)

    )x

    ),

    distinct_numbered_covered_dates AS

    (

    SELECT *,RN=ROW_NUMBER() OVER (PARTITION BY BusName ORDER BY covered_Date ASC)

    FROM (

    SELECT DISTINCT BusName,BusType,covered_date

    FROM covered_dates

    )x

    )

    SELECT BusName,

    BusType,

    StartDateKey=MIN(covered_date),

    EndDateKey=MAX(covered_date)

    FROM distinct_numbered_covered_dates

    GROUP BY BusName,BusType,DATEADD(dd,-RN,covered_date)

    ORDER BY BusName,

    BusType,

    StartDateKey;

    -- JLS solution

    WITH cte as (

    SELECT BusName,

    BusType,

    StartDateKey,

    EndDatekey,

    CASE

    WHEN DATEDIFF(day, LAG(enddatekey, 1, StartDateKey) OVER(PARTITION BY BusName, BusType ORDER BY Startdatekey ), StartDateKey) < 2

    THEN NULL

    ELSE ROW_NUMBER() OVER(PARTITION BY BusName, BusType ORDER BY Startdatekey )

    END sortkey,

    ROW_NUMBER() OVER(PARTITION BY BusName, BusType ORDER BY Startdatekey ) rn

    FROM tempdata

    )

    ,

    cte2 as (

    SELECT BusName,

    BusType,

    StartDateKey,

    EndDatekey,

    CAST(

    SUBSTRING(

    MAX( CAST(rn AS BINARY(4)) + CAST(sortkey AS BINARY(4)) )

    OVER( PARTITION BY BusName, BusType ORDER BY rn ROWS UNBOUNDED PRECEDING ),

    5, 4)

    AS INT) AS lastval

    FROM cte

    )

    SELECT BusName,

    BusType,

    MIN(StartDateKey) AS sdate,

    MAX(EndDatekey) AS edate

    FROM cte2

    GROUP BY BusName,

    BusType,

    lastval

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (4/29/2016)


    Jacob...dont seem to get the results expected from your code in comparison to mine.....have I misread your code:

    You didn't misread it, but the sample data doesn't conform to a rule that the OP's data did (and I assumed in my solution), which is that a given bus can be of just one type. Just partition the ROW_NUMBER by bus type in addition to bus name and they should match.

    I'll have to confirm later.

    Cheers!

  • Jacob Wilkins (4/29/2016)


    J Livingston SQL (4/29/2016)


    Jacob...dont seem to get the results expected from your code in comparison to mine.....have I misread your code:

    You didn't misread it, but the sample data doesn't conform to a rule that the OP's data did (and I assumed in my solution), which is that a given bus can be of just one type. Just partition the ROW_NUMBER by bus type in addition to bus name and they should match.

    I'll have to confirm later.

    Cheers!

    got it !...nice solution

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (4/29/2016)


    Jacob Wilkins (4/29/2016)


    J Livingston SQL (4/29/2016)


    Jacob...dont seem to get the results expected from your code in comparison to mine.....have I misread your code:

    You didn't misread it, but the sample data doesn't conform to a rule that the OP's data did (and I assumed in my solution), which is that a given bus can be of just one type. Just partition the ROW_NUMBER by bus type in addition to bus name and they should match.

    I'll have to confirm later.

    Cheers!

    got it !...nice solution

    Thanks! Yours works pretty nicely as well. It tends to perform better than my initial one in terms of CPU and duration.

    I have a feeling they would be much closer if instead of doing the lazy DISTINCT subquery I either 1) did a ROW_NUMBER partitioned by all columns in the subquery and added a WHERE row_num=1 to get rid of duplicates, or 2) skipped the subquery altogether and just replaced the ROW_NUMBER with DENSE_RANK, since then duplicates wouldn't matter.

    I'll have to test that out in a bit.

    Cheers!

  • to the OP

    you have marked my code as correct but havent posted any comments?

    would be interested to know how you are getting on. ??

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Tons of Thanks to you , in fixing and finding solution . But I am not sure , how it works on production data . As of now , I hold small data in development.

    Work around to improve performance :

    Run the logic over 100 Buses in a batch ,as logic is to condense records for each bus .

    But I will inform, how I deployed in production .

    Thanks

    Surya Sunil

  • sunil.mvs (5/2/2016)


    Tons of Thanks to you , in fixing and finding solution . But I am not sure , how it works on production data . As of now , I hold small data in development.

    Work around to improve performance :

    Run the logic over 100 Buses in a batch ,as logic is to condense records for each bus .

    But I will inform, how I deployed in production .

    Thanks

    Surya Sunil

    I did post some test scripts earlier on in thus thread....have you tried them?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply