Re: Hardware/OS recommendations for large databases (

Alan Stange <stange@xxxxxxxxxx> · Mon, 21 Nov 2005 09:57:59 -0500

Luke Lonergan wrote:
OK - slower this time:
We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
which all are capped at 120MB/s when doing sequential scans with different
versions of Postgres.

Postgresql issues the exact same sequence of read() calls as does dd.   
So why is dd so much faster?

I'd be careful with the dd read of a 16GB file on an 8GB system.  Make 
sure you umount the file system first, to make sure all of the file is 
flushed from memory.   Some systems use a freebehind on sequential reads 
to avoid flushing memory...and you'd find that 1/2 of your 16GB file is 
still in memory.   The same point also holds for the writes:  when dd 
finishes not all the data is on disk.   You need to issue a sync() call 
to make that happen.  Use lmdd to ensure that the data is actually all 
written.   In other words, I think your dd results are possibly misleading.

It's trivial to demonstrate:

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
800000+0 records in
800000+0 records out

real    0m13.780s
user    0m0.134s
sys     0m13.510s

Oops.   I just wrote 470MB/s to a file system that has peak write speed 
of 200MB/s peak.

Now, you might say that you wrote a 16GB file on an 8 GB machine so this 
isn't an issue.   It does make your dd numbers look fast as some of the 
data will be unwritten.

I'd also suggest running dd on the same files as postgresql.  I suspect 
you'd find that the layout of the postgresql files isn't that good as 
they are grown bit by bit, unlike the file created by simply dd'ing a 
large file.

Understand my point: It doesn't matter that there is idle or iowait on the
CPU, the postgres executor is not able to drive the I/O rate for two
reasons: there is a lot of CPU used for the scan (the 40% you reported) and
a lack of asynchrony (the iowait time).  That means that by speeding up the
CPU you only reduce the first part, but you don't fix the second and v.v.

With more aggressive readahead, the second problem (the I/O asynchrony) is
handled better by the Linux kernel and filesystem.  That's what we're seeing
with XFS.

I think your point doesn't hold up.  Every time you make it, I come away 
posting another result showing it to be incorrect.

The point your making doesn't match my experience with *any* storage or 
program I've ever used, including postgresql.   Your point suggests that 
the storage system is idle  and that postgresql is broken because it 
isn't able to use the resources available...even when the cpu is very 
idle.  How can that make sense?   The issue here is that the storage 
system is very active doing reads on the files...which might be somewhat 
poorly allocated on disk because postgresql grows the tables bit by bit.

I had the same readahead in Reiser and in XFS.   The XFS performance was 
better because XFS does a better job of large file allocation on disk, 
thus resulting in many fewer seeks (generated by the file system itself) 
to read the files back in.   As an example, some file systems like UFS 
purposely scatter large files across cylinder groups to avoid forcing 
large seeks on small files; one can tune this behavior so that large 
files are more tightly allocated.

Of course, because this is engineering, I have another obligatory data 
point:   This time it's a 4.2GB table using 137,138  32KB pages with 
nearly 41 million rows.

A "select count(1)" on the table completes in 14.6 seconds, for an 
average read rate of 320 MB/s. 

One cpu was idle, the other averaged 32% system time and 68 user time 
for the 14 second period.   This is on a 2.2Ghz Opteron.   A faster cpu 
would show increased performance as I really am cpu bound finally. 

Postgresql is clearly able to issue the relevant sequential read() 
system calls and sink the resulting data without a problem if the file 
system is capable of providing the data.  It can do this up to a speed 
of ~300MB/s on this class of system.   Now it should be fairly simple to 
tweak the few spots where some excess memory copies are being done and 
up this result substantially.  I hope postgresql is always using the 
libc memcpy as that's going to be a lot faster then some private routine.

-- Alan

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq