io storm on checkpoints, postgresql 8.2.4, linux

"Dmitry Potapov" <fortune.fish@xxxxxxxxx> · Wed, 22 Aug 2007 19:33:35 +0400

            Hello!

    We run a large (~66Gb) web-backend
database on Postgresql 8.2.4 on Linux.
The hardware is  Dual Xeon 5130 with 16Gb ram, LSI Megaraid U320-2x
scsi controller w/512Mb writeback cache and a BBU. Storage setup
contains 3 raid10 arrays (data, xlog, indexes, each on different
array), 12 HDDs total. Frontend application uses jdbc driver,
connection pooling and threads.

    We've run into an issue of IO storms on checkpoints. Once
in 20min (which is checkpoint_interval) the database becomes
unresponsive for about 4-8 seconds. Query processing is suspended,
server does nothing but writing a large amount of data to disks.
Because of the db server being stalled, some of the web clients get
timeout and disconnect, which is unacceptable. Even worse, as the new
requests come at a pretty constant rate, by the time this storm comes
to an end there is a huge amount of sleeping app. threads waiting for
their queries to complete. After the db server comes back to life
again, these threads wake up and flood it with queries, so performance
suffer even more, for some minutes after the checkpoint.

    It seemed strange to me that our 70%-read db generates so
much dirty pages that writing them out takes 4-8 seconds and grabs the
full bandwidth. First, I started to tune bgwriter to a more aggressive
settings, but this was of no help, nearly no performance changes at
all. Digging into the issue further, I discovered that linux page cache
was the reason. "Dirty" parameter in /proc/meminfo (which shows the
amount of ready-to-write "dirty" data currently sitting in page cache)
grows between checkpoints from 0 to about 100Mb. When checkpoint comes,
all the 100mb got flushed out to disk, effectively causing a IO storm.

    I found this (http://www.westnet.com/~gsmith/content/linux-pdflush.htm
) document and peeked into mm/page-writeback.c in linux kernel
source tree. I'm not sure that I understand pdflush writeout semantics
correctly, but looks like when the amount of "dirty" data is less than
dirty_background_ratio*RAM/100, pdflush only writes pages in
background, waking up every dirty_writeback_centisecs and writing no
more than 1024 pages (MAX_WRITEBACK_PAGES constant). When we hit
dirty_background_ratio, pdflush starts to write out more agressively.

    So, looks like the following scenario takes place:
postgresql constantly writes something to database and xlog files,
dirty data gets to the page cache, and then slowly written out by
pdflush. When postgres generates more dirty pages than pdflush writes
out, the amount of dirty data in the pagecache is growing. When we're
at checkpoint, postgres does fsync() on the database files, and sleeps
until the whole page cache is written out.

    By default, dirty_background_ratio is 2%, which is about
328Mb of 16Gb total. Following the curring pdflush logic, nearly this
amount of data we face to write out on checkpoint effective stalling
everything else, so even 1% of 16Gb is too much. My setup experience
4-8 sec pause in operation even on ~100Mb dirty pagecache...

     I temporaly solved this problem by setting
dirty_background_ratio to 0%. This causes the dirty data to be written
out immediately. It is ok for our setup (mostly because of large
controller cache), but it doesn't looks to me as an elegant solution.
Is there some other way to fix this issue without disabling pagecache
and the IO smoothing it was designed to perform?

-- 
Regards, 
            Dmitry