tl;dr: We've found that under many conditions, PostgreSQL's re-use of old WAL
files appears to significantly degrade query latency on ZFS. The reason is
complicated and I have details below. Has it been considered to make this
behavior tunable, to cause PostgreSQL to always create new WAL files instead of
re-using old ones?
Context: we're running a large fleet of PostgreSQL shards. Each shard consists
of a primary, a synchronous standby, and an asynchronous standby using chained
replication. For this problem, we can consider only the primary and
synchronous standby.
PostgreSQL: 9.6.3
OS: illumos (SmartOS, mixed versions, but all from 2017 or later)
FS: ZFS over mirrored HDDs (not SSDs), using a record size of 8K to match
PostgreSQL's record size. We have an SSD log device for completing synchronous
writes very quickly.
WAL files are 16MB each, and we're keeping a very large number of segments.
(There's likely a lot of improvement for WAL tuning here.)
Since we're using an 8K record size, when PostgreSQL writes small (or
non-aligned) records to the WAL files, ZFS has to read the old contents in
order to write the new 8K record (read-modify-write). If that's in cache,
that's not a big deal. But when PostgreSQL decides to reuse an old WAL file
whose contents have been evicted from the cache (because they haven't been used
in hours), this turns what should be a workload bottlenecked by synchronous write
performance (that's well-optimized with our SSD log device) into a random read
workload (that's much more expensive for any system).
What's significantly worse is that we saw this on synchronous standbys. When
that happened, the WAL receiver was blocked on a random read from disk, and
since it's single-threaded, all write queries on the primary stop until the
random read finishes. This is particularly bad for us when the sync is doing
other I/O (e.g., for an autovacuum or a database backup) that causes disk reads
to take hundreds of milliseconds.
Reusing old WAL files seems like an optimization intended for filesystems that
allocate disk blocks up front. With copy-on-write, it doesn't seem to make
much sense. If instead of using an old WAL file, PostgreSQL instead just
created a new one, there would be no random reads required to complete these
operations, and we believe we'd completely eliminate our latency outliers.
Thanks,
Dave