Re: Slow count(*) again...

Neil Whelchel <neil.whelchel@xxxxxxxxx> · Wed, 13 Oct 2010 03:16:11 -0700

On Wednesday 13 October 2010 01:50:23 Mark Kirkwood wrote:
> On 13/10/10 21:38, Neil Whelchel wrote:
> > So with our conclusion pile so far we can deduce that if we were to keep
> > all of our data in two column tables (one to link them together, and the
> > other to store one column of data), we stand a much better chance of
> > making the entire table to be counted fit in RAM, so we simply apply the
> > WHERE clause to a specific table as opposed to a column within a wider
> > table... This seems to defeat the entire goal of the relational
> > database...
> 
> That is a bit excessive I think - a more reasonable conclusion to draw
> is that tables bigger than ram will drop to IO max speed to scan, rather
> than DIMM max speed...
> 
> There are things you can do to radically improve IO throughput - e.g a
> pair of AMC or ARECA 12 slot RAID cards setup RAID 10 and tuned properly
> should give you a max sequential throughput of something like 12*100
> MB/s = 1.2 GB/s. So your example table (estimated at 2GB) so be able to
> be counted by Postgres in about 3-4 seconds...
> 
> This assumes a more capable machine than you are testing on I suspect.
> 
> Cheers
> 
> Mark
The good ol' bruit force approach! I knew I'd see this one sooner or later. 
Though I was not sure if I was going to see the 16TB of RAM suggestion first.
Seriously though, as the title of this thread suggests, everything is 
relative. Sure count(*) and everything else will work faster with more system 
power. It just seems to me that count(*) is slower than it could be given a 
set of conditions. I started this thread because I think that there must be a 
better way to count matches from an INDEXed column than shoving the entire 
table through RAM (including columns that you are not interested in at the 
minute). And even worse, when you have no (reasonable) control of the WHERE 
clause preventing your system from thrashing for the next week because 
somebody put in criteria that matched a few TB of records and there is no way 
to LIMIT count(*) other than externally timing the query and aborting it if it 
takes too long. Whet is needed is a way to determine how many rows are likely 
to match a given WHERE clause so we can cut off useless queries, but we need a 
fast count(*) for that, or a limit on the existing one... I seem to remember 
saying something about an index driven estimate(*) at one point...

I might go as far as to rattle the cage of the developers to see if it makes 
any sense to add some column oriented storage capability to Postgres. That 
would be the hot ticket to be able to specify an attribute on a column so that 
the back end could shadow or store a column in a column oriented table so 
aggregate functions could work on them with good efficiency, or is that an 
INDEX?

Since the thread has started, I have had people ask about different system 
configurations, especially the filesystem (XFS, ext4...). I have never tested 
ext4, and since we are all involved here, I thought that I could do so and 
share my results for others, that is why I got into time testing stuff.
Time testing count(*) in my later postings is really not the point as count is 
simply dragging the entire table off of the RAID through RAM, I can use any 
other function like max()... No that can narrow down its scan with an INDEX... 
Ok, sum(), there we go!

-Neil-

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance