Re: count(*) and bad design was: Experiences with extensibility

Ivan Sergio Borgonovo <mail@xxxxxxxxxxxxxxx> · Wed, 9 Jan 2008 15:28:04 +0100

On Wed, 9 Jan 2008 13:04:39 +0100
"Harald Armin Massa" <haraldarminmassa@xxxxxxxxx> wrote:

> Ivan,

> > Please forgive my naiveness in this field but what does it mean an
> > "exact count" and what other DB means with "an exact count" and
> > how other DB deal with it?

> PostgreSQL will give you an exact count of the contents of the
> database as it is in the moment you begin your count. (i.e. the
> transaction starts)

> BUT as the table is not locked, in parallel somebody can bulkload
> MANY items into the database, so at the moment (start of your
> transaction)
> + 1msec your count may be invalid allready.

That's reasonable. What other DB do and what is the SQL standard
requirement for count(*)?

> > I'd expect it perform as good or better than other DB since now
> > the bottleneck should be how efficiently it can filter records...
> > but still a count(*) with a where clause will incur in the same
> > problem of what "exact" means.
> 
> I know of 3 concepts to answer count() faster then PostreSQL:
> 
> 1) just lie, present an estimate
> 2) do not have MVCC
> 3) store "record deleted info" in index, so you can answer count()
> with only scanning the index

Sorry if I insist but I think this is a beaten selling point against
postgresql.
One of the most recurring argument about why someone else db is
better than pg is count is slow.

Who lies?
If it is possible to do otherwise to have count run faster what are
the trade off that make it unreasonable to implement it in pg?

This is not very useful question but 3) imply that select scan the
index return the rows and just later check if they are still there.
Is it?
And since indexes aren't updated "on the fly" you may get back a
larger number than what is actually the real value.

Let me consider an everyday use where count() looks as the most
obvious solution: paging.

I search trough a table and I need to know which is the last page.
Concurrently someone is deleting a ton of records.
No matter if count() is fast or not when I output the pager it will
be "wrong". But still I'll need an estimate of the number of pages,
it is not fun if that estimate is wrong *and* slow.
And once you add the where clauses there is no way to cache the count.
Is there a way to count based on indexes without taking into account
deleted rows so to "count" faster?

I can make the search faster using indexes as Dann Corbit suggested,
but as you imply that won't make count as fast as the
"concurrence"[1] that lie or don't use MVCC or store deleted info in
indexes.

SELECT reltuples FROM pg_class WHERE relname = <table_name>;

doesn't apply since you can't add "conditions".

Please be patient. I hear this over and over and over. Postgresql is
faulty because it can't count fast.
And the only reply I've seen are: it is bad design to use count,
you're a dumb ass. I admit I may be a dumb ass, but it is hard to
sell Postgres if I continue to be a dumb ass ;)

- What count(*) really does?
Now I understood that count(*) return the # of rows as it sees them
at the moment it was invoked. That should be true for other DB as
well. That means that unless other DB lock the table they can't take
into account records that are going to be deleted once the count has
been issued.

- When count can be avoided?
Well since even other DB may report the "wrong" number, this makes
count() look less a Sacre Graal. But still if you need an estimate,
wouldn't it be better to have it quickly?
How does postgresql compare to other DB when you run:
select count(*) from table where conditions
once you use indexes?
If such kind of query will have anyway to scan the results to see if
they are still there since info about deleted records aren't stored
in indexes, is there a way to ignore this and just have a faster
estimate?
I still can't see why it is bad design to use count().

- When count() can't be avoided?
All the situation where you may really need count() I think you also
need to lock the table but well I'd be curious to see an example
where you need count()

Still can somebody make an example of bad design and one where
count() couldn't be avoided if any?

Consider that while it makes few sense to rely on "wrong" numbers in
a "business" environment where data integrity/coherence makes *a lot*
of sense it is not so clear in a CMS world where most of those
critics come from.

I know that the arguments to promote postgres in the "business" world
where DB2, Oracle and MS SQL play (?) may be different and count()
may lose its importance in that context and you could say that other
advantages plenty pay off the "slowness" of an operation that in such
a context is rare.

thanks

[1] or does postgres perform as the concurrence once you add where
clauses?

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org/