Re: autovacuum hung on simple tables

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Mon, 28 Nov 2022 08:53:15 +0100

On Mon, 2022-11-28 at 04:05 +0000, senor wrote:
> I'm a little late getting back to this but still have no solution.
> 
> I had inadvertently executed a vacuum freeze on a table that pg_stat_activity
> indicated was currently being autovacuumed. The manual vacuum succeeded while
> leaving the previous autovacuum still showing active in pg_stat_activity.

No, that cannot be.  VACUUM processes are mutually exclusive.
It could have been another autovacuum worker (different "pid").

> Manual vacuum always completes, never stalls, but also often does not appear
> in pg_stat_progress_vacuum unless it's a longer process.

It *does* appear in "pg_stat_progress_vacuum", but perhaps you are too slow
to catch it.

> It appears the process completes the vacuum but does not register that fact.
> relfrozenxid of the main table is what would be expected but an associated
> toast table was still very old. Cancelling all pending vacuums of the table
> and manually running vacuum completes in a few seconds and both the main
> relation and toast are updated as expected with last vacuum time updated
> in pg_stat_all_tables.

Obviously autovacuum devided to process the main table, but not the TOAST
table.  That is normal.  Manual VACUUM processes both, unless you specify
the option PROCESS_TOAST OFF.

> Autoacuum and autovacuum analyze both get hung.

No, they don't; not unless they are anti-wraparound autovacuum runs and you
are holding high locks on the table in a long running transaction.
They may be slow to complete, because in v11 autovacuum by default is very
slow indeed, as "autovacuum_vacuum_cost_delay" is 20ms.

> I often see the same table listed multiple times in pg_stat_activity with
> different pids and:
> state: active
> backend_type: autovacuum worker
> wait_event_type: blank
> state_change: 10-15 seconds after backend_start - about how long it takes
> to manually run vacuum on most tables.

That can be parallel workers that are used to scan indexes.

> What exactly does autovacuum rely on in the stats file? I ran strace on a
> hung autovacuum process and saw a repeated read of
> /run/postgresql/db_16384.stat (tempfs). The file is 740MB which is about
> the same as other similar installations I've reviewed. I'm lacking in overall
> experience in this though.

I don't know either, and I am too lazy to read the code on that, but I suspect
that it has no connection to your problem.

> One final oddity:
> I ran a query for oldest relfrozenxid and redirected to file. The query took
> around a minute. A few seconds after it finished, I queried for 'active' in
> pg_stat_activity and the oldest relfrozenxid query was still listed.
> A few seconds later it had cleared.

That sounds weird and is hard to believe.
Are the disk or the CPU under extreme stress?

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com