On 10/10/2011 10:14 AM, Leonardo Francalanci wrote:
I don't understand: don't you want postgresql to issue the fsync calls when it "makes sense" (and configure them), rather than having the OS decide when it's best to flush to disk? That is: don't you want all the memory to be used for caching, unless postgresql says otherwise (calling fsync), instead of "as soon as 1% of memory is used"?
You'd think that, which is why this bites so many people. That's not quite how it works in practice, though. OS cache is a lot more than altered database and WAL files, which do get fsync'd frequently. Beyond that, you need to worry about what happens *between* fsync calls.
On a highly loaded database, or even just a database experiencing heavy write volume due to some kind of ETL process, your amount of dirty memory may increase much more quickly than you expect. For example, say your checkpoint_timeout setting is the default of five minutes. An ETL process runs that loads 2GB of data into a table, and you're archiving transaction logs. So you now have three possible write vectors, not including temp tables and what not. And that's only for that connection; this gets more complicated if you have other OLTP connections on the same DB.
So your memory is now flooded with 2-6GB of data, and that's easy for memory to handle, and it can do so quickly. With 48GB of RAM, that's well within caching range, so the OS never writes anything until the fsync call. Then the database makes the fsync call, and suddenly the OS wants to flush 2-6GB of data straight to disk. Without that background trickle, you now have a flood that only the highest-end disk controller or a backing-store full of SSDs or PCIe NVRAM could ever hope to absorb.
That write flood massively degrades your read IOPS, and degrades future writes until it's done flushing, so all of your disk IO is saturated, further worsening the situation. Now you're getting closer and closer to your dirty_ratio setting, at which point the OS will effectively stop responding to anything, so it can finally finish flushing everything to disk. This can take a couple minutes, but it's not uncommon for these IO storms to last over half an hour depending on the quality of the disks/controller in question. During this time, system load is climbing precipitously, and clients are getting query timeouts.
Adding more memory can actually make your system performance worse if you don't equally increase the capability of your RAID/SAN/whatever to compensate for increased size of write chunks.
This is counter-intuitive, but completely borne out by tests. The kernel developers agree, or we wouldn't have dirty_bytes, or dirty_background_bytes, and they wouldn't have changed the defaults to 5% and 10% instead of 10% and 40%. It's just one of those things nobody expected until machines with vast amounts of RAM started becoming common.
-- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@xxxxxxxxx ______________________________________________ See http://www.peak6.com/email-disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance