Hello,
We've been struggling with some major performance issues related to autovacuum.
(I know this is a common problem.) For a while, we believed this was primarily
related to I/O contention, but recent observations from our monitoring make me
wonder if there's a deeper issue here, so I'm looking for some better
understanding.
From reading the 9.6.3 source, it looks like the autovacuum process itself is
single-threaded, and it reads pages essentially linearly from the relation
(possibly skipping some). When the autovacuum process needs to modify a page,
it doesn't write it directly, but rather marks the buffer dirty. The page will
be written later, either by the checkpointer (according to its configuration, in
terms of time and WAL), the bgwriter (according to its configuration, in terms
of write count and sleep time), or else some other backend process that requires
a free buffer (if the count of non-dirty buffers reaches zero). Is this
accurate?
In our case, we found that when autovacuum runs, the number of dirty buffers
written by regular backends shoots from a handful to as much as 300 buffers per
second. (We have 200 backends on most databases.) More specifically: prior to
recent autovacuums starting, databases did under 1000 buffer allocations per
second. Most of those were done by the checkpointer. Individual backends did
just a handful. After autovacuum started, buffer allocations rose to between
2,000 and 8,000 per second. The checkpointer handled many of these, but so did
individual backends (see above). The bgwriter peaked around 25 buffer writes
per second. So it seems like the spike in buffers written by normal backends
could explain the significant degradation that we see in average latency and
overall throughput (which can be as much as 50%).
It looks to me like the autovacuum process is effectively generating work (in
the form of async writes) that's being distributed implicitly to the various
backend processes, creating latency for any other query that happens to require
a buffer (including read-only queries). Maybe we can improve this by tuning the
bgwriter. But if that's single-threaded, presumably there's an upper bound to
how many buffer writes it can handle? Is there anything to guarantee that the
bgwriter will keep up with the work that's being generated? That is, for some
workloads, wouldn't it be possible that autovacuum could always generate work
faster than the bgwriter can do it, and you'd always have some pretty major
degradation to all other queries?
I've drawn a lot of inferences here, and I'm sure there are some mistakes. I'd
appreciate any clarification, correction, or confirmation!
Thanks in advance,
Dave