On 2015-03-15 20:54:51 +0300, Ilya Kosmodemiansky wrote: > On Sun, Mar 15, 2015 at 8:46 PM, Andres Freund <andres@xxxxxxxxxxxxxxx> wrote: > > That imo doesn't really have anything to do with it. The primary benefit > > of a BBU with writeback caching is accelerating (near-)synchronous > > writes. Like the WAL. > > My point was, that having no proper raid controller (today bbu surely > needed for the controller to be a proper one) + heavy writes of any > kind, it is absolutely impossible to live with large shared_buffers > and without io problems. And my point is that you're mostly wrong. What a raid controller's writeback usefully cache accelerates is synchronous writes. I.e. writes that the application waits for. Usually raid controllers don't have much chance to reorderer the queued writes (i.e. turning neighboring writes into one larger sequential write). What they do excel at is making synchronous writes to disk return faster because the data is only written to the the controller's memory, not to actual storage. They're also good at avoiding actual writes to disk when the *same* page is written to multiple times in short amount of time. In postgres writes for data that goes through shared_buffers are usally asynchronous. We write them to the OS's page cache when a page is needed for other contents, is undirtied by the bgwriter, or written out during a checkpoint; but do *not* wait for the write to hit the disk. The controller's write back cache doesn't hugely help here, because it doesn't make *that* much of a difference whether the dirty data stays in the kernel's page cache or in the controller. In contrast to that, writes to the WAL are often more or les synchronous. We actually wait (via fdatasync()/fsync() syscalls) for writes to hit disk in a bunch of scenarios, most commonly when committing a transaction. Unless synchronous_commit = off every COMMIT in a transaction that wrote data implies a fdatasync() of a file in pg_xlog (well, we do optimize that in some condition, but let's leave that out for now). Additionally, if there are many smaller/interleaved transactions, we will write()/fdatasync() out the same 8kb WAL page repeatedly. Everytime a transaction commits (and some other things) the page that commit record is on will be flushed. As the WAL records for insertions, updates, deletes, commits are frequently much smaller than 8kb that will often happen 20-100 for the same page in OLTP scenarios with narrow rows. That's why synchronous_commit = off can be such a huge win for OLTP write workloads without a writeback cache - synchronous writes are turned into asynchronous writes, and repetitive writes to the same page are avoided. It also explains why synchronous_commit = off has much less an effect for bulk write workloads: As there are no synchronous disk writes due to WAL flushes at commit time (there's only very few commits), synchronous commit doesn't have much of an effect. That said, there's a couple reasons why you're not completely wrong: Historically, when using ext3 with data=ordered and some other filesystems, synchronous writes to one file forced *all* other previously dirtied data to also be flushed. That means that if you have pg_xlog and the normal relation files on the same filesystem, the synchronous writes to the WAL will not only have to write out the new WAL (often not that much data), but also all the other dirty data. The OS will often be unable to do efficient write combining in that case, because a) there's not yet that much data there, b) postgres doesn't order writes during checkpoints. That means that WAL writes will suddenly have to write out much more data => COMMITs are slow. That's where the suggestion to keep pg_xlog on a separate partion largely comes from. Writes going through shared_buffers are sometimes indirectly turned into synchronous writes (from the OSs perspective at least. Which means they'll done at a higher priority). That happens when the checkpointer fsync()s all the files at the end of a checkpoint. When things are going well and checkpoints are executed infrequently and completely guided by time (i.e. triggered solely by checkpoint_timeout, and not checkpoint_segments) that's usually not too bad. You'll see a relatively small latency spike for transactions. Unfortunately the ext* filesystems have a implementation problem here, which can make this problem much worse: The way writes are priorized during an fsync() can stall out concurrent synchronous reads/writes pretty badly. That's much less of a problem with e.g. xfs. Which is why I'd atm not suggest ext4 for write intensive applications. The other situation where this can lead to big problems is if your checkpoints aren't scheduled by time (you can recognize that by enabling log_checkpoints and check a) that time is the trigger, b) they're actually happening in a interval consistent with checkpoint_timeout). If the relation files are not writtten out in a smoothed out fashion (configured by checkpoint_completion_target) a *lot* of dirty buffers can exist in the OS's page cache. Especially because the default 'dirty' settings in linux on servers with a lot of IO are often completely insane; especially with older kernels (pretty much everything before 3.11 is badly affected). The important thing to do here is to configure checkpoint_timeout, checkpoint_segments and checkpoint_completion_target in a consistent way. In my opinion the default checkpoint_timeout is *way* too low; and just leads to a large increase in overall writes (due to more frequent checkpoints repeated writes to the same page aren't coalesced) *and* an increase in WAL volume (many more full_page_writes). Lots more could be written about this topic; but I think I've blathered on enough for the moment ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance