top n percent WITH group by?

Question

Post reply

top n percent WITH group by?

sqlchick

Old Hand

Points: 329
More actions
March 31, 2005 at 2:27 pm

#68813

Is it possible to use Top N Percent with Group By? I have a table with approximately 50 types of datasets and I need to find the top 90% of EACH of those types (not top 90% of the combined total). Can SQL Server do this, or must I (gulp) run the query and store the results 50 times?

Thanks,

Heather

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply

Farrell Keough SSCoach Points: 17414 More actions · Answer 1

A simple bulldozer approach might be to select each type into a #TempTable and then UNION ALL into a final #TempTable for output.

Otherwise, it sounds to me like you are going to need to isolate each type of dataset (SELECT DISTINCT) into a #TempTable and then loop through each of those types to get your TOP 90%. This may actually be a better option since if a new type is added, then this should handle that situation....

I wasn't born stupid - I had to study.

sqlchick Old Hand Points: 329 More actions · Answer 2

sqlchick

Old Hand

Points: 329

March 31, 2005 at 3:34 pm

#549934

Thanks, Farrell. I had a feeling that would be the answer!

Jim P. SSCrazy Eights Points: 8725 More actions · Answer 3

Assuming that your key field is not the top 10 percent field, I think it would be something like below. I maybe overly complicating it...but this is how I would do it. There may be a more efficient way. I'm sure someone will let us know.

SELECT groupfield1, groupfield2, .....
FROM tablename
WHERE keyfield IN (SELECT keyfield
                 FROM tablename
                 WHERE numfield 
               IN (SELECT TOP 10 PERCENT numfield
                  FROM tablename))
GROUP BY groupfield1, groupfield2, .....

Again there may be a better way. I hope there is. And if your keyfield is what your trying to get then you should be able to get rid of the second subquery. Let us know.

----------------
Jim P.

A little bit of this and a little byte of that can cause bloatware.

sqlchick Old Hand Points: 329 More actions · Answer 4

Hmm, interesting. To make it more clear, what I have is an enormous table of records (over 8 million to be exact) that contain records on every state in the US. So for instance, I may have a 30,000 records for Virginia and 50,000 for Texas.

I am doing a sort by a particular numerical value - let's say population, just as an example. So I need to find the top 90% of records for every state by population density.

The problem is that when I try to get all the records using group by State, I get the top 90% of the total - not the top 90% within each state. Unfortunately, it really doesn't look like I can tell SQL to perform the 90% calculation with the group by results, so I suppose I'll have to loop through them one by one.

hoo-t SSChampion Points: 12713 More actions · Answer 5

hoo-t

SSChampion

Points: 12713

March 31, 2005 at 3:59 pm

#549937

Don't give up. I'm sure there is a way to do this with a set based query, I just can't get my brain wrapped around it just now (too close to going home time ). I would imagine that someone will post a set based query, but if not, I'll look at it more tomorrow.

Steve

sqlchick Old Hand Points: 329 More actions · Answer 6

sqlchick

Old Hand

Points: 329

March 31, 2005 at 4:08 pm

#549938

Thanks, guys.

The good news is, I only have to do this once.

Aaron Templeton SSCarpal Tunnel Points: 4655 More actions · Answer 7

You can't use GROUP BY unless all non grouped columns are agregate results. If you want the 90% rows actually individually reported you have to drop GROUP BY and go with just an ordered selection. Is this what you are after?

SELECT A.State, A.Region, A.PopulationDensity

FROM MyTable A

WHERE A.Region IN ( SELECT TOP 90 PERCENT B.Region

FROM MyTable B

WHERE B.State = A.State

ORDER BY B.PopulationDensity )

ORDER BY A.State, A.PopulationDensity

sqlchick Old Hand Points: 329 More actions · Answer 8

I know I can use a subquery to get the top 90% of records from a single state and then perform my state summary calculations from there. I was just hoping that I could get the top 90% of each state's records in a single dataset using a group by clause.

What I may end up doing at this point is just outputting the results I need one by one into a new table (unfortunately, with milions of records, a temp table would be tough to use) and then query those new tables respectively.

-heather

iadams SSCommitted Points: 1920 More actions · Answer 9

iadams

SSCommitted

Points: 1920

April 1, 2005 at 9:52 am

#550137

What about creating views (as your groups) and querying (UNION) the views?

-Isaiah

sqlchick Old Hand Points: 329 More actions · Answer 10

That's also a possibility. Because I have 8 million records and I need to grab different slices of percentiles per state, I'm just going to create a few additional tables to hold the state information I need.

For example:

--- must run for each state

SELECT TOP 10 PERCENT *

INTO tblStateCensus_top10percent

FROM tblStateCensus

WHERE state = 'AL'

ORDER BY population DESC

SELECT TOP 5 PERCENT *

INTO tblStateCensus_top5percent

FROM tblStateCensus

WHERE state = 'AL'

ORDER BY population DESC

--- run once after all states are run in above queries

SELECT *

INTO tblStateCensus_bottom90percent

FROM tblStateCensus

WHERE id not in (select id from tblStateCensus_top10percent)

SELECT *

INTO tblStateCensus_bottom95percent

FROM tblStateCensus

WHERE id not in (select id from tblStateCensus_top5percent)

Using the 'not in' for the additional percentages to avoid overlap of records.

I could write some code to loop through this, or I could just run the top half of that 50 or so times. Probably the latter will be faster.

As I said, this is one-time data analysis for journalism purposes so the queries don't need to be optimized for performance.

Thank you all for your help!

Heather

tschneid Valued Member Points: 63 More actions · Answer 11

I've had to do something like this before. Using the derived table concept should simplify your life.

select top 90 percent

from

(select fld1,fld2 from MyTestTable group by fld1,fld2)

In this example, do your grouping sql statement, make that the derived table, then select 90 percent of those results.

Good Luck

--Todd

sqlchick Old Hand Points: 329 More actions · Answer 12

Unfortunately I can't do that in this case, because I need the top percent to be done within the groupings - so I need the top 90% of records for alabama, and of arkansas, and so on. That gives me the top 90% of the total record number.

It's okay, guys. I'm just outputting the records into a couple of different tables. I'll be finished by the time I figure out a better way to do it. Since it's for a journalism purpose, I don't have much time.

Thanks for all of your help!

Heather

Jim P. SSCrazy Eights Points: 8725 More actions · Answer 13

Instead of tables I would do views. Less data space and once you create the first one in the QA such as:

CREATE VIEW Top_90_AK

AS

SELECT TOP 90 PERCENT *

FROM TableName

WHERE STATE = 'AK'

Then all you have to do is a search and replace for AK with AL and so on.

----------------
Jim P.

A little bit of this and a little byte of that can cause bloatware.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 14

Here's somthing to play with (I wouldn't recommend it on a 8 M lines table though).

SELECT O.XType

-- , count(*) AS TotalHits_Found

, O.name

, (SELECT CEILING(COUNT(*) * 0.9) FROM dbo.SysObjects O4 WHERE O4.XType = O.XType) as [90%]

, (SELECT COUNT(*) FROM dbo.SysObjects O5 WHERE O5.XType = O.XType) AS [100%]

FROM dbo.SysObjects O

WHERE EXISTS (SELECT * FROM dbo.SysObjects O2 WHERE O2.XType = O.XType AND O2.id = O.id and O2.id IN (SELECT TOP 90 PERCENT id FROM dbo.SysObjects O3 WHERE O3.XType = O.XType ORDER BY O3.id))

--GROUP BY O.XType

ORDER BY O.XType

, O.Name