Re: Slow count(*) again...

david@xxxxxxx · Tue, 12 Oct 2010 08:48:34 -0700 (PDT)

On Tue, 12 Oct 2010, Joe Uhl wrote:

The biggest single problem with "select count(*)" is that it is
seriously overused. People use that idiom to establish existence, which
usually leads to a performance disaster in the application using it,
unless the table has no more than few hundred records. SQL language, of
which PostgreSQL offers an excellent implementation,  offers [NOT]
EXISTS clause since its inception in the Jurassic era. The problem is
with the sequential scan, not with counting. I'd even go as far as to
suggest that 99% instances of the "select count(*)" idiom are probably
bad use of the SQL language.

I agree, I have seen many very bad examples of using count(*). I will go so
far as to question the use of count(*) in my examples here. It there a 
better
way to come up with a page list than using count(*)? What is the best 
method
to make a page of results and a list of links to other pages of results? Am 
I
barking up the wrong tree here?
One way I have dealt with this on very large tables is to cache the count(*) 
at the application level (using memcached, terracotta, or something along 
those lines) and then increment that cache whenever you add a row to the 
relevant table.  On application restart that cache is re-initialized with a 
regular old count(*).  This approach works really well and all large systems 
in my experience need caching in front of the DB eventually.  If you have a 
simpler system with say a single application/web server you can simply store 
the value in a variable, the specifics would depend on the language and 
framework you are using.

this works if you know ahead of time what the criteria of the search is 
going to be.

so it will work for

select count(*) from table;

what this won't work for is cases wher the criteria of the search is 
unpredictable, i.e.

ask the user for input

select count(*) from table where field=$input;

David Lang

Another more all-DB approach is to create a statistics tables into which you 
place aggregated statistics rows (num deleted, num inserted, totals, etc) at 
an appropriate time interval in your code.  So you have rows containing 
aggregated statistics information for the past and some tiny portion of the 
new data happening right now that hasn't yet been aggregated.  Queries then 
look like a summation of the aggregated values in the statistics table plus a 
count(*) over just the newest portion of the data table and are generally 
very fast.

Overall I have found that once things get big the layers of your app stack 
start to blend together and have to be combined in clever ways to keep speed 
up.  Postgres is a beast but when you run into things it can't do well just 
find a way to cache it or make it work together with some other persistence 
tech to handle those cases.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance