Re: Hardware/OS recommendations for large databases (

Alan Stange <stange@xxxxxxxxxx> · Sat, 19 Nov 2005 21:43:48 -0500

Luke Lonergan wrote:
Alan,

On 11/18/05 11:39 AM, "Alan Stange" <stange@xxxxxxxxxx> wrote:

Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy
and 60% idle (aka iowait in the above numbers).

The "aka iowait" is the problem here - iowait is not idle (otherwise it
would be in the "idle" column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.

iowait time is idle time. Period.   This point has been debated 
endlessly for Solaris and other OS's as well.

Here's the man page:
             %iowait
                    Show  the  percentage  of  time that the CPU or 
CPUs were
                    idle during which the system had an outstanding 
disk  I/O
                    request.

If the system had some other cpu bound work to perform you wouldn't ever 
see any iowait time.  Anyone claiming the cpu was 100% busy on the 
sequential scan using the one set of numbers I posted is 
misunderstanding the actual metrics.

Thanks for your data, it exemplifies many of the points brought up:
- Lots of disks and expensive I/O hardware does not help improve performance
on large table queries because I/O bandwidth does not scale beyond
110-120MB/s on the fastest CPUs

I don't think that is the conclusion from anecdotal numbers I posted.   
This file subsystem doesn't perform as well as expected for any tool. 
Bonnie, dd, star, etc., don't get a better data rate either.   In fact, 
the storage system wasn't built for performance; it was build to 
reliably hold a big chunk of data.   Even so,  postgresql is reading at 
130MB/s on it, using about 30% of a single cpu, almost all of which was 
system time.   I would get the same 130MB/s on a system with cpus that 
were substantially slower; the limitation isn't the cpus, or 
postgresql.  It's the IO system that is poorly configured for this test, 
not postgresqls ability to use it.

In fact, given the numbers I posted, it's clear this system could 
handily generate more than 120 MB/s using a single cpu given a better IO 
subsystem;  it has cpu time to spare.   A simple test can be done:   
build the database in /dev/shm and time the scans.  It's the same read() 
system call being used and now one has made the IO system "infinitely 
fast".  The claim is being made that standard postgresql is unable to 
generate more than 120MB/s of IO on any IO system due to an inefficient 
use of the kernel API and excessive memory copies, etc.  Having the 
database be on a ram based file system is an example of "expensive IO 
hardware" and all else would be the same.   Hmmm, now that I think about 
this,  I could throw a medium sized table onto /dev/shm using 
tablespaces on one of our 8GB linux boxes.    So why is this experiment 
not valid, or what is it about the above assertion that I am missing?

Anyway, if one cares about high speed sequential IO, then one should use 
a much larger block size to start.   Using 8KB IOs is inappropriate for 
such a configuration.  We happen to be using 32KB blocks on our largest 
database and it's been the best move for us.

-- Alan

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org