On Tue, Aug 28, 2007 at 10:00:57AM -0500, Erik Jones wrote: > >> It seemed strange to me that our 70%-read db generates so much > >>dirty > >>pages that writing them out takes 4-8 seconds and grabs the full > >>bandwidth. > >>First, I started to tune bgwriter to a more aggressive settings, > >>but this > >>was of no help, nearly no performance changes at all. Digging into > >>the issue > >>further, I discovered that linux page cache was the reason. "Dirty" > >>parameter in /proc/meminfo (which shows the amount of ready-to- > >>write "dirty" > >>data currently sitting in page cache) grows between checkpoints > >>from 0 to > >>about 100Mb. When checkpoint comes, all the 100mb got flushed out > >>to disk, > >>effectively causing a IO storm. > >> > >> I found this (http://www.westnet.com/~gsmith/content/linux- > >>pdflush.htm > >><http://www.westnet.com/%7Egsmith/content/linux-pdflush.htm>) > >>document and > >>peeked into mm/page-writeback.c in linux kernel source tree. I'm > >>not sure > >>that I understand pdflush writeout semantics correctly, but looks > >>like when > >>the amount of "dirty" data is less than dirty_background_ratio*RAM/ > >>100, > >>pdflush only writes pages in background, waking up every > >>dirty_writeback_centisecs and writing no more than 1024 pages > >>(MAX_WRITEBACK_PAGES constant). When we hit > >>dirty_background_ratio, pdflush > >>starts to write out more agressively. > >> > >> So, looks like the following scenario takes place: postgresql > >>constantly > >>writes something to database and xlog files, dirty data gets to > >>the page > >>cache, and then slowly written out by pdflush. When postgres > >>generates more > >>dirty pages than pdflush writes out, the amount of dirty data in the > >>pagecache is growing. When we're at checkpoint, postgres does fsync > >>() on the > >>database files, and sleeps until the whole page cache is written out. > >> > >> By default, dirty_background_ratio is 2%, which is about 328Mb > >>of 16Gb > >>total. Following the curring pdflush logic, nearly this amount of > >>data we > >>face to write out on checkpoint effective stalling everything > >>else, so even > >>1% of 16Gb is too much. My setup experience 4-8 sec pause in > >>operation even > >>on ~100Mb dirty pagecache... > >> > >> I temporaly solved this problem by setting > >>dirty_background_ratio to > >>0%. This causes the dirty data to be written out immediately. It > >>is ok for > >>our setup (mostly because of large controller cache), but it > >>doesn't looks > >>to me as an elegant solution. Is there some other way to fix this > >>issue > >>without disabling pagecache and the IO smoothing it was designed > >>to perform? > > > >You are working at the correct level. The bgwriter performs the I/O > >smoothing > >function at the database level. Obviously, the OS level smoothing > >function > >needed to be tuned and you have done that within the parameters of > >the OS. > >You may want to bring this up on the Linux kernel lists and see if > >they have > >any ideas. > > > >Good luck, > > > >Ken > > Have you tried decreasing you checkpoint interval? That would at > least help to reduce the amount of data that needs to be flushed when > Postgres fsyncs. The downside to that is it will result in writing a lot more data to WAL as long as full page writes are on. Isn't there some kind of a timeout parameter for how long dirty data will sit in the cache? It seems pretty broken to me to allow stuff to sit in a dirty state indefinitely. -- Decibel!, aka Jim Nasby decibel@xxxxxxxxxxx EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment:
pgpCjANnuNsTA.pgp
Description: PGP signature