On Aug 22, 2007, at 10:57 AM, Kenneth Marshall wrote:
On Wed, Aug 22, 2007 at 07:33:35PM +0400, Dmitry Potapov wrote:
Hello!
We run a large (~66Gb) web-backend database on Postgresql
8.2.4 on
Linux. The hardware is Dual Xeon 5130 with 16Gb ram, LSI Megaraid
U320-2x
scsi controller w/512Mb writeback cache and a BBU. Storage setup
contains 3
raid10 arrays (data, xlog, indexes, each on different array), 12
HDDs total.
Frontend application uses jdbc driver, connection pooling and
threads.
We've run into an issue of IO storms on checkpoints. Once in
20min
(which is checkpoint_interval) the database becomes unresponsive
for about
4-8 seconds. Query processing is suspended, server does nothing
but writing
a large amount of data to disks. Because of the db server being
stalled,
some of the web clients get timeout and disconnect, which is
unacceptable.
Even worse, as the new requests come at a pretty constant rate, by
the time
this storm comes to an end there is a huge amount of sleeping app.
threads
waiting for their queries to complete. After the db server comes
back to
life again, these threads wake up and flood it with queries, so
performance
suffer even more, for some minutes after the checkpoint.
It seemed strange to me that our 70%-read db generates so much
dirty
pages that writing them out takes 4-8 seconds and grabs the full
bandwidth.
First, I started to tune bgwriter to a more aggressive settings,
but this
was of no help, nearly no performance changes at all. Digging into
the issue
further, I discovered that linux page cache was the reason. "Dirty"
parameter in /proc/meminfo (which shows the amount of ready-to-
write "dirty"
data currently sitting in page cache) grows between checkpoints
from 0 to
about 100Mb. When checkpoint comes, all the 100mb got flushed out
to disk,
effectively causing a IO storm.
I found this (http://www.westnet.com/~gsmith/content/linux-
pdflush.htm
<http://www.westnet.com/%7Egsmith/content/linux-pdflush.htm>)
document and
peeked into mm/page-writeback.c in linux kernel source tree. I'm
not sure
that I understand pdflush writeout semantics correctly, but looks
like when
the amount of "dirty" data is less than dirty_background_ratio*RAM/
100,
pdflush only writes pages in background, waking up every
dirty_writeback_centisecs and writing no more than 1024 pages
(MAX_WRITEBACK_PAGES constant). When we hit
dirty_background_ratio, pdflush
starts to write out more agressively.
So, looks like the following scenario takes place: postgresql
constantly
writes something to database and xlog files, dirty data gets to
the page
cache, and then slowly written out by pdflush. When postgres
generates more
dirty pages than pdflush writes out, the amount of dirty data in the
pagecache is growing. When we're at checkpoint, postgres does fsync
() on the
database files, and sleeps until the whole page cache is written out.
By default, dirty_background_ratio is 2%, which is about 328Mb
of 16Gb
total. Following the curring pdflush logic, nearly this amount of
data we
face to write out on checkpoint effective stalling everything
else, so even
1% of 16Gb is too much. My setup experience 4-8 sec pause in
operation even
on ~100Mb dirty pagecache...
I temporaly solved this problem by setting
dirty_background_ratio to
0%. This causes the dirty data to be written out immediately. It
is ok for
our setup (mostly because of large controller cache), but it
doesn't looks
to me as an elegant solution. Is there some other way to fix this
issue
without disabling pagecache and the IO smoothing it was designed
to perform?
--
Regards,
Dmitry
Dmitry,
You are working at the correct level. The bgwriter performs the I/O
smoothing
function at the database level. Obviously, the OS level smoothing
function
needed to be tuned and you have done that within the parameters of
the OS.
You may want to bring this up on the Linux kernel lists and see if
they have
any ideas.
Good luck,
Ken
Have you tried decreasing you checkpoint interval? That would at
least help to reduce the amount of data that needs to be flushed when
Postgres fsyncs.
Erik Jones
Software Developer | Emma®
erik@xxxxxxxxxx
800.595.4401 or 615.292.5888
615.292.0777 (fax)
Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend