Re: Hardware/OS recommendations for large databases (

Alan Stange <stange@xxxxxxxxxx> · Tue, 22 Nov 2005 22:57:16 -0500

Bruce Momjian wrote:
Greg Stark wrote:

Alan Stange <stange@xxxxxxxxxx> writes:

The point your making doesn't match my experience with *any* storage or program
I've ever used, including postgresql.   Your point suggests that the storage
system is idle  and that postgresql is broken because it isn't able to use the
resources available...even when the cpu is very idle.  How can that make sense?

Well I think what he's saying is that Postgres is issuing a read, then waiting
for the data to return. Then it does some processing, and goes back to issue
another read. The CPU is idle half the time because Postgres isn't capable of
doing any work while waiting for i/o, and the i/o system is idle half the time
while the CPU intensive part happens.

(Consider as a pathological example a program that reads 8k then sleeps for
10ms, and loops doing that 1,000 times. Now consider the same program
optimized to read 8M asynchronously and sleep for 10s. By the time it's
finished sleeping it has probably read in all 8M. Whereas the program that
read 8k in little chunks interleaved with small sleeps would probably take
twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
idle.)

It's a reasonable theory and it's not inconsistent with the results you sent.
But it's not exactly proven either. Nor is it clear how to improve matters.
Adding additional threads to handle the i/o adds an enormous amount of
complexity and creates lots of opportunity for other contention that could
easily eat all of the gains.

Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
"create a thread" approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.
For sequential scans, you do have a background reader.  It's the 
kernel.  As long as you don't issue a seek() between read() calls, the 
kernel will get the hint about sequential IO and begin to perform a read 
ahead for you.  This is where the above analysis isn't quite right:  
while postgresql is processing the returned data from the read() call, 
the kernel has also issued reads as part of the read ahead, keeping the 
device busy while the cpu is busy.  (I'm assuming these details for 
Linux; Solaris/UFS does work this way).  Issue one seek on the file and 
the read ahead algorithm will back off for a while.   This was my point 
about some descriptions of how the system works not being sensible.

If your goal is sequential IO, then one must use larger block sizes.   
No one would use 8KB IO for achieving high sequential IO rates.   Simply 
put, read() is about the slowest way to get 8KB of data.     Switching 
to 32KB blocks reduces all the system call overhead by a large margin.  
Larger blocks would be better still, up to the stripe size of your 
mirror.   (Of course, you're using a mirror and not raid5 if you care 
about performance.)

I don't think the memcpy of data from the kernel to userspace is that 
big of an issue right now.  dd and all the high end network interfaces 
manage OK doing it, so I'd expect postgresql to do all right with it now 
yet too.   Direct IO will avoid that memcpy, but then you also don't get 
any caching of the files in memory.  I'd be more concerned about any 
memcpy calls or general data management within postgresql.    Does 
postgresql use the platform specific memcpy() in libc?  Some care might 
be needed to ensure that the memory blocks within postgresql are all 
properly aligned to make sure that one isn't ping-ponging cache lines 
around (usually done by padding the buffer sizes by an extra 32 bytes or 
L1 line size).   Whatever you do, all the usual high performance 
computing tricks should be used prior to considering any rewriting of 
major code sections.

Personally, I'd like to see some detailed profiling being done using 
hardware counters for cpu cycles and cache misses, etc.   Given the poor 
quality of work that has been discussed here in this thread, I don't 
have much confidence in any other additional results at this time.   
None of the analysis would be acceptable in any environment in which 
I've worked.   Be sure to take a look at Sun's free Workshop tools as 
they are excellent for this sort of profiling and one doesn't need to 
recompile to use them.    If I get a little time in the next week or two 
I might take a crack at this.

Cheers,

-- Alan