On Wednesday 13 October 2010 01:50:23 Mark Kirkwood wrote: > On 13/10/10 21:38, Neil Whelchel wrote: > > So with our conclusion pile so far we can deduce that if we were to keep > > all of our data in two column tables (one to link them together, and the > > other to store one column of data), we stand a much better chance of > > making the entire table to be counted fit in RAM, so we simply apply the > > WHERE clause to a specific table as opposed to a column within a wider > > table... This seems to defeat the entire goal of the relational > > database... > > That is a bit excessive I think - a more reasonable conclusion to draw > is that tables bigger than ram will drop to IO max speed to scan, rather > than DIMM max speed... > > There are things you can do to radically improve IO throughput - e.g a > pair of AMC or ARECA 12 slot RAID cards setup RAID 10 and tuned properly > should give you a max sequential throughput of something like 12*100 > MB/s = 1.2 GB/s. So your example table (estimated at 2GB) so be able to > be counted by Postgres in about 3-4 seconds... > > This assumes a more capable machine than you are testing on I suspect. > > Cheers > > Mark The good ol' bruit force approach! I knew I'd see this one sooner or later. Though I was not sure if I was going to see the 16TB of RAM suggestion first. Seriously though, as the title of this thread suggests, everything is relative. Sure count(*) and everything else will work faster with more system power. It just seems to me that count(*) is slower than it could be given a set of conditions. I started this thread because I think that there must be a better way to count matches from an INDEXed column than shoving the entire table through RAM (including columns that you are not interested in at the minute). And even worse, when you have no (reasonable) control of the WHERE clause preventing your system from thrashing for the next week because somebody put in criteria that matched a few TB of records and there is no way to LIMIT count(*) other than externally timing the query and aborting it if it takes too long. Whet is needed is a way to determine how many rows are likely to match a given WHERE clause so we can cut off useless queries, but we need a fast count(*) for that, or a limit on the existing one... I seem to remember saying something about an index driven estimate(*) at one point... I might go as far as to rattle the cage of the developers to see if it makes any sense to add some column oriented storage capability to Postgres. That would be the hot ticket to be able to specify an attribute on a column so that the back end could shadow or store a column in a column oriented table so aggregate functions could work on them with good efficiency, or is that an INDEX? Since the thread has started, I have had people ask about different system configurations, especially the filesystem (XFS, ext4...). I have never tested ext4, and since we are all involved here, I thought that I could do so and share my results for others, that is why I got into time testing stuff. Time testing count(*) in my later postings is really not the point as count is simply dragging the entire table off of the RAID through RAM, I can use any other function like max()... No that can narrow down its scan with an INDEX... Ok, sum(), there we go! -Neil- -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance