Re: MusicBrainz postgres performance issues

Andres Freund <andres@xxxxxxxxxxxxxxx> · Mon, 16 Mar 2015 00:30:49 +0100

On 2015-03-15 20:54:51 +0300, Ilya Kosmodemiansky wrote:
> On Sun, Mar 15, 2015 at 8:46 PM, Andres Freund <andres@xxxxxxxxxxxxxxx> wrote:
> > That imo doesn't really have anything to do with it. The primary benefit
> > of a BBU with writeback caching is accelerating (near-)synchronous
> > writes. Like the WAL.
> 
> My point was, that having no proper raid controller (today bbu surely
> needed for the controller to be a proper one) + heavy writes of any
> kind, it is absolutely impossible to live with large shared_buffers
> and without io problems.

And my point is that you're mostly wrong. What a raid controller's
writeback usefully cache accelerates is synchronous writes. I.e. writes
that the application waits for. Usually raid controllers don't have much
chance to reorderer the queued writes (i.e. turning neighboring writes
into one larger sequential write). What they do excel at is making
synchronous writes to disk return faster because the data is only
written to the the controller's memory, not to actual storage. They're
also good at avoiding actual writes to disk when the *same* page is
written to multiple times in short amount of time.

In postgres writes for data that goes through shared_buffers are usally
asynchronous. We write them to the OS's page cache when a page is needed
for other contents, is undirtied by the bgwriter, or written out during
a checkpoint; but do *not* wait for the write to hit the disk.  The
controller's write back cache doesn't hugely help here, because it
doesn't make *that* much of a difference whether the dirty data stays in
the kernel's page cache or in the controller.

In contrast to that, writes to the WAL are often more or les
synchronous. We actually wait (via fdatasync()/fsync() syscalls) for
writes to hit disk in a bunch of scenarios, most commonly when
committing a transaction. Unless synchronous_commit = off every COMMIT
in a transaction that wrote data implies a fdatasync() of a file in
pg_xlog (well, we do optimize that in some condition, but let's leave
that out for now).

Additionally, if there are many smaller/interleaved transactions, we
will write()/fdatasync() out the same 8kb WAL page repeatedly. Everytime
a transaction commits (and some other things) the page that commit
record is on will be flushed. As the WAL records for insertions,
updates, deletes, commits are frequently much smaller than 8kb that will
often happen 20-100 for the same page in OLTP scenarios with narrow
rows.  That's why synchronous_commit = off can be such a huge win for
OLTP write workloads without a writeback cache - synchronous writes are
turned into asynchronous writes, and repetitive writes to the same page
are avoided. It also explains why synchronous_commit = off has much less
an effect for bulk write workloads: As there are no synchronous disk
writes due to WAL flushes at commit time (there's only very few
commits), synchronous commit doesn't have much of an effect.

That said, there's a couple reasons why you're not completely wrong:

Historically, when using ext3 with data=ordered and some other
filesystems, synchronous writes to one file forced *all* other
previously dirtied data to also be flushed. That means that if you have
pg_xlog and the normal relation files on the same filesystem, the
synchronous writes to the WAL will not only have to write out the new
WAL (often not that much data), but also all the other dirty data.  The
OS will often be unable to do efficient write combining in that case,
because a) there's not yet that much data there, b) postgres doesn't
order writes during checkpoints.  That means that WAL writes will
suddenly have to write out much more data => COMMITs are slow.  That's
where the suggestion to keep pg_xlog on a separate partion largely comes
from.

Writes going through shared_buffers are sometimes indirectly turned into
synchronous writes (from the OSs perspective at least. Which means
they'll done at a higher priority). That happens when the checkpointer
fsync()s all the files at the end of a checkpoint. When things are going
well and checkpoints are executed infrequently and completely guided by
time (i.e. triggered solely by checkpoint_timeout, and not
checkpoint_segments) that's usually not too bad. You'll see a relatively
small latency spike for transactions.
Unfortunately the ext* filesystems have a implementation problem here,
which can make this problem much worse: The way writes are priorized
during an fsync() can stall out concurrent synchronous reads/writes
pretty badly. That's much less of a problem with e.g. xfs. Which is why
I'd atm not suggest ext4 for write intensive applications.

The other situation where this can lead to big problems is if your
checkpoints aren't scheduled by time (you can recognize that by enabling
log_checkpoints and check a) that time is the trigger, b) they're
actually happening in a interval consistent with checkpoint_timeout). If
the relation files are not writtten out in a smoothed out fashion
(configured by checkpoint_completion_target) a *lot* of dirty buffers
can exist in the OS's page cache. Especially because the default 'dirty'
settings in linux on servers with a lot of IO are often completely
insane; especially with older kernels (pretty much everything before
3.11 is badly affected).  The important thing to do here is to configure
checkpoint_timeout, checkpoint_segments and checkpoint_completion_target
in a consistent way.  In my opinion the default checkpoint_timeout is
*way* too low; and just leads to a large increase in overall writes (due
to more frequent checkpoints repeated writes to the same page aren't
coalesced) *and* an increase in WAL volume (many more full_page_writes).

Lots more could be written about this topic; but I think I've blathered
on enough for the moment ;)

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance