Re: optimizing query with multiple aggregates

"Marc Mamin" <M.Mamin@xxxxxxxxxxxx> · Mon, 26 Oct 2009 10:39:48 +0100

Hello,

I didn't try it, but following should be slightly 
faster:

COUNT( CASE WHEN field >= x AND field < y THEN true 
END)
intead of 
SUM( CASE WHEN field >= x AND field < y THEN 1 ELSE 0 
END)

HTH,

Marc 
Mamin

From: pgsql-performance-owner@xxxxxxxxxxxxxx 
[mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Nikolas 
Everett
Sent: Thursday, October 22, 2009 4:48 AM
To: Doug 
Cole
Cc: pgsql-performance
Subject: Re:  optimizing 
query with multiple aggregates

So you've got a query like:

SELECT SUM(CASE 
WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as 
zeroToTen,

   SUM(CASE WHEN 
field >= 10 AND field < 20 THEN 1 ELSE 0 END) as 
tenToTwenty,

     SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 
END) as tenToTwenty,
...
FROM 
 bigtable

My guess is this 
forcing a whole bunch of if checks and your getting cpu bound.  Could you 
try something like:

SELECT SUM(CASE 
WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as 
zeroToTen,

   SUM(CASE WHEN 
field >= 10 AND field < 20 THEN count ELSE 0 END) as 
tenToTwenty,

     SUM(CASE WHEN field >= 20 AND field < 30 THEN count 
ELSE 0 END) as tenToTwenty,
...
FROM  (SELECT 
field, count(*) FROM bigtable GROUP BY field)

which will allow a hash 
aggregate?  You'd do a hash aggregate on the whole table which should be 
quick and then you'd summarize your bins.

This all supposes that 
you don't want to just query postgres's column statistics.

On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@xxxxxxxxx> wrote:

  On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@xxxxxxxxx> 
  wrote:
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@xxxxxxxxx> wrote:
> 
  > I have a reporting query that is taking nearly all of it's time in 
  aggregate
> > functions and I'm trying to figure out how to optimize 
  it.  The query takes
> > approximately 170ms when run with 
  "select *", but when run with all the
> > aggregate functions the 
  query takes 18 seconds.  The slowness comes from our
> > attempt 
  to find distribution data using selects of the form:
> >
> > 
  SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> 
  >
> > repeated across many different x,y values and fields to 
  build out several
> > histograms of the data.  The main culprit 
  appears to be the CASE statement,
> > but I'm not sure what to use 
  instead.  I'm sure other people have had
> > similar queries and 
  I was wondering what methods they used to build out data
> > like 
  this?
>
> have you tried:
>
> count(*) where field 
  >= x AND field < y;
>
> ??
>
> 
  merlin

Unless I'm misunderstanding you, that would require 
  breaking each bin
into a separate sql statement and since I'm trying to 
  calculate more
than 100 bins between the different fields any improvement 
  in the
aggregate functions would be overwhelmed by the cost of the 
  actual
query, which is about 170ms.
Thanks,
Doug

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To 
  make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance