Aggregate Query question

  • >>Do you want to have dup districts? Means can a tupple be like this 'dist1,dist1,dist1' = 300 if dist1 has 100 rows?

    No. All old districts must be included, but distinct in the new set.

    >>Cross Join

    FWIW, along with Celko's help, I solved it using negation, then cross-joining the three resulting table sets (Unique, doubles, triples). We're just curious about being able to solve it more elegantly using the newer SQL functions.

    I'm going to spend some time taking a look at the Cube implications though...

    Thx,

    ~Les

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • Cubes solution is just like your unique ,doubles,tripples but it can have more than tripple like say 30 column combinations and thus in the end will need less tables for cross joins based on the number of distinct values.

    But you might need to write some dynamic code or if the values are known and smaller than you can just write it yourself :). But for that you need to understand the grouping_id() function which determines which columns/values are summed.

    GulliMeel

    Finding top n Worst Performing queries[/url]
    Improve the performance of Merge Join(special case)
    How to Post Performance Problem -Gail Shaw[/url]

  • dwain.c (6/30/2012)


    Sorry but I do know why the CTE returns 0 rows. You must still have the population equality check on the last line. Try taking that out.

    You'll find that the reason it only takes 7 minutes to run is the limiter I put in (n <= 3 I think it was). To make it solve the problem, you'd need to take that out. But I'd suggest just incrementing the 3 by 1 a couple of times and watch the impact it has on the run times before taking it out entirely.

    That was it. I made those changes and it iterated out to 153k rows, with 1 and 2 district combination of up to 4 districts. I'm now testing 6.

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • Dwain,

    LOL... after 2hrs, it tossed...

    "An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown."

    ...on a server with 64gb RAM. Last I looked after 30min before leaving, it was up to 10mill rows.

    What's needed is to impose 'implicit negation', or the enumeration becomes too much for in-memory resolution. The problem with stated negation is its lack of elegance (and a lot of redundant code). Hence my search for a multidimensional key for this thing. Where a MK is derived from 2n+ domains in a co-domain, this is problematic for this type of problem because even after resolving Candidates, the litmus is the total population of Sweden (abstract domain), and uniqueness of old districts within the new districts. Each of the Candidates are a subset of that domain, but the parent key (population of Sweden) is a composite. I do think recursion through a CTE or another may provide the answer, and it's probably an atomic reduction to the simple, but it is elusive.

    That said, I think if we calculate the doubles in the same way D1 was derived (and triples per T1), the popultation litmus for those subsets would mitigate the enumeration count in the same way SW7 and SW8 did in the negation solution. Just a thought.

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • Hey Les! Just wanted you to know that I haven't forgotten about you and Sweden.

    Here's a progress report. I have a single, cascading CTE that generates the following:

    1. 24 districts combined in 8 triplets

    2. 20 districts combined in 10 doublets

    3. 11 districts that could participate in a doublet or a triplet that were eliminated

    4. 11 districts that cannot participate in a doublet or a triplet due to exclusions of map or population

    After running for about 29 minutes it spewed out 7776 combinations that I believe are unique.

    I still need to "fill in the missing districts." Point 4 I already figured out a couple of days ago. Point #3 not yet certain how to do that but I have some ideas.

    Now a question to you. I presume the optimal solution to this problem results in as many districts participating in a doublet or triplet as possible. Theoretically this would be 55 (excludes #4). What is the best solution that you were able to come up with?

    I suspect that once I examine the results produced by the query now running, I may have some ideas how to eek out a few extra doublets or triplets and improve the results set, hopefully without too much added computational cost.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Let me share with you my current thought pattern to see if you have any comment.

    I mentioned in my prior post (point #4) that "11 districts that cannot participate in a doublet or a triplet due to exclusions of map or population." The following query lists these out.

    ;WITH Singles AS (

    -- The original idea was to convert each district into a VARCHAR with leading zeroes,

    -- however ultimately that was two costly on comparisons later so this CTE can probably be removed

    -- with a very slight boost to performance (just use the Sweden table instead of Singles)

    SELECT district

    ,d1=RIGHT('00'+CAST(district AS VARCHAR), 2)

    ,[population]

    FROM Sweden),

    Doubles AS (

    -- This CTE creates (from the Map) all possible combined districts with their populations

    SELECT m.district, neighbor

    ,d1=m.district -- RIGHT('00'+CAST(m.district AS VARCHAR), 2)

    ,n1=neighbor -- RIGHT('00'+CAST(neighbor AS VARCHAR), 2)

    ,[population]=SUM([population])

    FROM Map m

    INNER JOIN Sweden s

    ON m.district = s.district or m.neighbor = s.district

    GROUP BY m.district, neighbor),

    Combine2and3 ( n, d1, d2, d3, dall, [Population]) AS (

    -- This recursive CTE (one execution only of the recursive leg) combines the two/three

    -- district combinations

    SELECT 2

    ,district -- d1

    ,neighbor -- n1

    ,0 -- '00'

    ,CAST(d1 AS VARCHAR(8000)) + ',' + CAST(n1 AS VARCHAR(8000))

    ,[population]

    FROM Doubles

    WHERE population < 6000

    UNION ALL

    SELECT 1 + n.n

    ,y.d1

    ,y.d2

    ,y.d3

    ,CASE WHEN d < n.d1 THEN CAST(d AS VARCHAR(8000))+ ',' + CAST(n.dall AS VARCHAR(8000))

    WHEN d > n.d2 THEN CAST(n.dall AS VARCHAR(8000)) + ',' + CAST(d AS VARCHAR(8000))

    ELSE CAST(n.d1 AS VARCHAR(8000)) + ',' + CAST(d AS VARCHAR(8000)) + ',' + CAST(n.d2 AS VARCHAR(8000)) END

    ,n.[population] + s.[population]

    FROM Combine2and3 n

    INNER JOIN Doubles d ON (d.d1 IN (n.d1, n.d2) or d.n1 IN (n.d1, n.d2))

    CROSS APPLY (

    SELECT CASE WHEN d.d1 IN (n.d1, n.d2) THEN d.n1 ELSE d.d1 END) x(d)

    INNER JOIN Singles s ON s.d1 = d

    -- To support later steps, we'll keep the districts sorted within the tuple

    CROSS APPLY (

    SELECT d1=CASE WHEN d < n.d1 THEN d WHEN d > d2 THEN n.d1 ELSE n.d1 END

    ,d2=CASE WHEN d < n.d1 THEN n.d1 WHEN d > d2 THEN n.d2 ELSE d END

    ,d3=CASE WHEN d < n.d1 THEN n.d2 WHEN d > d2 THEN d ELSE n.d2 END) y

    WHERE n < 3 AND y.d1 <> y.d2 AND y.d2 <> y.d3

    ),

    NewDistricts AS (

    -- Finally NewDistricts converts the above to a unique set of (90+) doubles and triples

    -- for use later. Up through here, all of these steps together are quite fast.

    SELECT n, d1, d2, d3, dall, [population]

    FROM (

    SELECT n, d1, d2, d3, dall, [population]

    ,rn=ROW_NUMBER() OVER (PARTITION BY n, d1, d2, d3 ORDER BY (SELECT NULL))

    FROM Combine2and3

    WHERE [population] BETWEEN 5900 AND 6000) x

    WHERE rn=1),

    SpecialDistricts AS (

    SELECT d56=SUM([1]), d57=SUM([2]), d58=SUM([3]), d59=SUM([4]), d60=SUM([5]), d61=SUM([6])

    ,d62=SUM([7]), d63=SUM([8]), d64=SUM([9]), d65=SUM([10]), d66=SUM([11])

    ,[population]=SUM([population])

    FROM (

    SELECT n=ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), y.district, [population]

    FROM (

    -- Special districts are not included in the NewDistricts due to population limits

    -- and map constraints (there are exactly 11)

    SELECT district FROM Sweden EXCEPT

    SELECT d1

    FROM (

    SELECT d1 FROM NewDistricts

    UNION ALL SELECT d2 FROM NewDistricts

    UNION ALL SELECT d3 FROM NewDistricts) x) y

    INNER JOIN Sweden s ON s.district = y.district) ps

    PIVOT (MAX(district)

    FOR n IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11])) pt)

    SELECT *

    FROM SpecialDistricts

    The results (pivoted for a reason) are:

    d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 population

    2 9 11 24 26 35 40 47 48 58 63 27161

    I don't think that my current "packing" of the triplets and doublets into the final solution. But I'm thinking about packing them based on the number of occurrences of each district in a doublet or triplet by using the next query to count these occurrances.

    -- <snip> The same cascaded CTE to calculate NewDistricts as above

    SELECT d, [Count]=SUM([Count])

    FROM NewDistricts

    CROSS APPLY (VALUES (d1, 1),(d2, 1),(d3, 1)) x(d, [Count])

    WHERE d <> 0

    GROUP BY d

    ORDER BY SUM([Count]), d

    I'll let you run that one yourself but I'm thinking that if I pack them in from smallest count to largest count, the smaller counts are less likely to get eliminated on subsequent packing passes. Not only that, the number of n-tuples combined on the earlier passes should result in significantly fewer result rows, at least in the earlier passes.

    This should be fairly easy to accomplish later on today. I'll post results when available.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (7/2/2012)


    Here's a progress report. I have a single, cascading CTE that generates the following:

    1. 24 districts combined in 8 triplets

    2. 20 districts combined in 10 doublets

    3. 11 districts that could participate in a doublet or a triplet that were eliminated

    4. 11 districts that cannot participate in a doublet or a triplet due to exclusions of map or population

    After running for about 29 minutes it spewed out 7776 combinations that I believe are unique.

    My latest is (single cascaded CTE):

    1. 24 districts combined in 8 triplets

    2. 22 districts combined in 11 doublets

    3. 9 districts that could participate in a doublet or a triplet that were eliminated

    4. 11 districts that cannot participate in a doublet or a triplet due to exclusions of map or population

    This runs in under 3 seconds and generates a single row result.

    When you tell me my target packing factor, I'll try some more to see if I can attain it.

    Jeff - If you're still listening, I call the solution the "ultimate, amalgamated, conglomerated, cascading CROSS APPLY" solution. :w00t: You'd like it. 😀


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Hi Dwain,

    You're getting close.

    If you first isolate those districts from Candidates where the Old District only appears once (see SW4.doc), they can then be eliminated from the rest. There are 12 old districts that fall into that bucket (NewDistricts table) which make up 6 New Districts, which are implicitly distinct and result in only 1 unique set for these 12...

    d1 d2 d3 pop

    3 52 0 5999

    6 13 0 5997

    9 26 0 6013

    35 63 0 6053

    48 0 0 5976

    58 47 2 6080

    Your set is partially correct, but compare it to this set as a litmus.

    At the bottom of SW5, I check for doubles/triples that might exist in both, against this set. There is only 1, which is then deleted from the Remaining candidates.

    From Candidates, if you then isolate those into Doubles and Triples (SW6.doc), you end up with...

    24 Old Districts in Doubles (D1)

    30 Old Districts in Triples (T1)

    Since there are 24 old districts in doubles, divided by 2 per New District, it resolves to 12 new (possible) districts.

    As well, 30 old districts in triples, divided by 3 per new district, resolves to 10 new (possible) districts.

    When projected...

    ...the doubles (D2) create 58 distinct sets of doubles whose population = D1.

    ...the triples (T2) create 162 distinct sets of triples whose population = T1.

    Since those districts that were eliminated first create only 1 result set, the total possible number of solutions to the Sweden problem is...

    1x58x162=9396

    ~Les

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • In your original spec you stated the population set requirement:

    Les Cardwell (6/22/2012)

    Sweden Challenge...

    * Total 'old' number of Districts was 66.

    * Total Sweden Population = 167,976.

    * New District population set requirement = between 5,900 and 6,000 each.

    * Each New District had to be created using neighbors of Old Districts.

    * Approx. number of 'new' districts = 11 or 12 = ((167,976 / 5,900) / 2.5 old districts to each new district)

    * Possible Candidates after resolving 'neighbors' for the redistricting = 151 new districts, each containing 1, 2, or 3 'old' districts (based on population requirements).

    ~Les

    But according to your latest description of those specific sets (and as it appears in your SW3) the limitation is:

    BETWEEN 5900 AND 6100

    Shame on you for throwing me off the scent! I have a bit of rework to do.

    I was hoping for a little encouragement on my 3 sec. run times. Are they in the ballpark?


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Yesterday , I got a chance to read the thread fully. The cube method wont work as mentioned because you have some kind of mapping between districts but cube method will provide you all perm and comb..But you can filter them out. That method is going to bit lengthy because there are 66 districts.

    Could you please post the final output, which you expect?

    GulliMeel

    Finding top n Worst Performing queries[/url]
    Improve the performance of Merge Join(special case)
    How to Post Performance Problem -Gail Shaw[/url]

  • After revising my population range check to BETWEEN 5900 AND 6100, I now get results like this:

    My latest is (single cascaded CTE):

    1. 30 districts combined in 10 triplets

    2. 22 districts combined in 11 doublets

    3. 7 districts that could participate in a doublet or a triplet that were eliminated

    4. 7 districts that cannot participate in a doublet or a triplet due to exclusions of map or population

    And the code still runs in less than 3 seconds to produce this set of combined districts:

    -- Doublets

    [3,52],[6,13],[8,66],[43,59],[1,12],[25,55],[27,57],[31,36],[10,46],[42,64],[23,32],

    -- Triplets

    [2,47,58],[22,49,50],[7,14,24],[15,19,37],[5,56,61],[21,30,41],[4,38,54],[17,18,53],[20,29,39],[16,34,44],

    -- Eliminated districts

    [28],[33],[45],[51],[60],[62],[65],

    -- Ineligible districts

    [9],[11],[26],[35],[40],[48],[63]

    While not the optimal solution (yet), I think I should get some points for speed. 😀

    It appears that I'm off the optimal solution because [9,26] (population 6013) are coming up in my ineligible list. If I can resolve that, I think I'll be there.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (7/3/2012)

    It appears that I'm off the optimal solution because [9,26] (population 6013) are coming up in my ineligible list. If I can resolve that, I think I'll be there.

    Figured out why! Now to reconstruct my scenario (in about 8 hours).


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Dwain, Just a small point

    Why do you need CTE because this is more of a specifc requirement and if you have 4 districts then it will exceed 6100 unless you have 4 districts with less than 1525 population which I do not see here.. So join the map table twice should be enough here?

    Or are you looking for more general solution?

    GulliMeel

    Finding top n Worst Performing queries[/url]
    Improve the performance of Merge Join(special case)
    How to Post Performance Problem -Gail Shaw[/url]

  • Gullimeel (7/3/2012)


    Dwain, Just a small point

    Why do you need CTE because this is more of a specifc requirement and if you have 4 districts then it will exceed 6100 unless you have 4 districts with less than 1525 population which I do not see here.. So join the map table twice should be enough here?

    Or are you looking for more general solution?

    I'm using a CTE for convenience during development. There is a possibility that it might run faster with a properly indexed table, but I can break it up and try that later.

    I think you need to take a shot at the problem yourself to appreciate that it is quite complex. Just downloading all the code Les did will help you to see that.

    The real challenge is going to be to come up with all of the solutions that Les said there were (3725 I think). I'm pretty sure I'll hit the optimal solution on my next attempt, but then to generalize that to produce all possible solutions (and run in real time) is the true challenge.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Dwain,

    My apologies for the mistype on the Population. Fat fingers and all that. :blush:

    I think the only thing the CTE is not resolving correctly is those those Candidates where the Old District only appears once (SW4.doc), which can otherwise seem that there are district combinations that don't appear to map to a neighbor and/or resolve in combination to a population between 5900 and 6100. Other than that, the speed is quite amazing, especially given that in '98, I couldn't find a machine that could process the Triples (SW8.doc), even after a week. The only thing I've found that outperforms a well designed CTE for such is when a multidimensional key (MK) can be formed (or absracted) for partitioning the data. I do think JC's Bin Packing solution offers a path for a MK key, but I haven't explored that just yet. Time...

    G...

    I'm attaching a spreadsheet with the negation results. The first set (SW4) is a single set. I just didn't list it as a single row, but it would be...

    [3,52][6,13][9,26][35,63][48][58,47,2]

    The total number of possible New Sweden's as related earlier, when cross applied is 1x58x162=9396.

    Each new Sweden would have 28 new districts (6 uniques, 12 doubles, 10 triples) out of the old districts. There are 12 old districts in the unique set, 24 in the doubles, and 33 in the triples = 66 (original old districts).

    36118 Unique Population per set

    71930 Double Population per set

    59928 Triple Population per set

    ---------

    167976 Total Sweden Population (old and new)

    ~Les

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

Viewing 15 posts - 76 through 90 (of 109 total)

You must be logged in to reply to this topic. Login to reply