Re: Slow performance when querying millions of rows

Greg Smith <greg@xxxxxxxxxxxxxxx> · Tue, 28 Jun 2011 21:01:53 -0400

On 06/28/2011 07:50 PM, Craig McIlwee wrote:
I was thinking that shared buffers controlled the amount of data, 
primarily table and index pages, that the database could store in 
memory at once.  Based on that assumption, I thought that a larger 
value would enable an entire table + index to be in memory together 
and speed up the query.  Am I wrong?

It does to some extent.  But:

a) This amount doesn't impact query planning as much if you've set a 
large effective_cache_size

b) The operating system is going to cache things outside of PostgreSQL, too

c) Data read via a sequential scan sometimes skips going into 
shared_buffers, to keep that cache from being swamped with any single scan

d) until the data has actually made its way into memory, you may be 
pulling it in there by an inefficient random process at first.  By the 
time the cache is populated, the thing you wanted a populated cache to 
accelerate may already have finished.

It's possible to get insight into this all using pg_buffercache to 
actually see what's in the cache, and I've put up some talks and scripts 
to help with that at http://projects.2ndquadrant.com/talks you might 
find useful.

Looks like they are set to adaptive read-ahead now.  If the database 
is executing many concurrent queries, is it reasonable to suspect that 
the IO requests will compete with each other in such a way that the 
controller would rarely see many sequential requests since it is 
serving many processes?  The controller does have an 'on' option also 
that forces read-ahead, maybe that would solve the issue if we can 
rely on the data to survive in the cache until the actual read request 
takes place.

I've never been able to find good documentation on just what the 
difference between the adaptive and on modes of that controller really 
are, which is why I suggested you try both and see.  Linux has a 
uniquely good read-ahead model that was tuned with PostgreSQL 
specifically in mind.  And you still have to tweak it upwards from the 
defaults in order for the database to fetch things as fast as the drives 
are capable sometimes.  So your idea that you will meet/exceed the 
drive's capabilities for bulk sequential scans is less likely than you 
might think.  RAID5 in theory should give you 2X or more of the speed of 
any single disk when reading a clustered table, but the way PostgreSQL 
does it may make that hard to realize on Windows.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
Comprehensive and Customized PostgreSQL Training Classes:
http://www.2ndquadrant.us/postgresql-training/

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance