On Tue, Apr 17, 2018 at 11:57 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Alvaro Herrera <alvherre@xxxxxxxxxxxxxx> writes:
> David Pacheco wrote:
>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL
>> files appears to significantly degrade query latency on ZFS. The reason is
>> complicated and I have details below. Has it been considered to make this
>> behavior tunable, to cause PostgreSQL to always create new WAL files
>> instead of re-using old ones?
> I don't think this has ever been proposed, because there was no use case
> for it. Maybe you want to work on a patch for it?
I think possibly the OP doesn't understand why it's designed that way.
The point is not really to "recycle old WAL files", it's to avoid having
disk space allocation occur during the critical section where we must
PANIC on failure. Now, of course, that doesn't really work if the
filesystem is COW underneath, because it's allocating fresh disk space
anyway even though semantically we're overwriting existing data.
But what I'd like to see is a fix that deals with that somehow, rather
than continue to accept the possibility of ENOSPC occurring inside WAL
writes on these file systems. I have no idea what such a fix would
look like :-(
I think I do understand, but as you've observed, recycling WAL files to avoid allocation relies on the implementation details of the filesystem -- details that I'd expect not to be true of any copy-on-write filesystem. On such systems, there may not be a way to avoid ENOSPC in special critical sections. (And that's not necessarily such a big deal -- to paraphrase a colleague, ensuring that the system doesn't run out of space does not seem like a particularly surprising or heavy burden for the operator. It's great that PostgreSQL can survive this event better on some systems, but the associated tradeoffs may not be worthwhile for everybody.) And given that, it seems worthwhile to provide the operator an option where they take on the risk that the database might crash if it runs out of space (assuming the result isn't data corruption) in exchange for a potentially tremendous improvement in tail latency and overall throughput.
To quantify this: in a recent incident, transaction latency on the primary was degraded about 2-3x (from a p90 of about 45ms to upwards of 120ms, with outliers exceeding 1s). Over 95% of the outliers above 1s spent over 90% of their time blocked on synchronous replication (based on tracing with DTrace). On the synchronous standby, almost 10% of the WAL receiver's wall clock time was spent blocked on disk reads in this read-modify-write path. The rest of the time was essentially idle -- there was plenty of headroom in other dimensions (CPU, synchronous write performance).
Thanks,
Dave