The
error is not actually showing up very often (I have 8 occurrences from
11/29 and none since then). So maybe I should not be concerned about
it. I suspect
we have an I/O bottleneck from other logs (i.e. long checkpoint sync
times), so this error may be a symptom rather than the cause.
I
think that at the point it is getting cancelled, it has done all the
work except the truncation of the empty pages, and reporting the results
(for example, updating n_live_tup and n_dead_tup). If this happens
every single time (neither last_autovacuum nor last_vacuum ever
advances) it will eventually cause problems. So this is mostly a
symptom, but not entirely. Simply running a manual vacuum should fix
the reporting problem. It is not subject to cancelling, so it will
detect it is blocking someone and gracefully bow. Meaning it will
suspend the truncation, but will still report its results as normal.
Reading
the table backwards in order to truncate it might be contributing to
the IO problems as well as being a victim of those problems. Upgrading
to v10 might help with this, as it implemented a prefetch where it reads
the table forward in 128kB chunks, and then jumps backwards one chunk
at a time. Rather than just reading backwards 8kB at a time.
Cheers,
Jeff