Re: autovacuum hung on simple tables

senor <frio_cervesa@xxxxxxxxxxx> · Mon, 28 Nov 2022 04:05:08 +0000

I'm a little late getting back to this but still have no solution.
I 100% agree that updating postgres is best. The usual excuses for not updating apply. It will be done when it is allowed to be done. It remains 11.4 for now. I read through the changelogs up through 11.18 and didn't see anything obviously related. The underlying cause could be pretty involved and something I wouldn't recognize.

Thank you Laurenz Albe for reminding me about an important clue. I had inadvertently executed a vacuum freeze on a table that pg_stat_activity indicated was currently being autovacuumed. The manual vacuum succeeded while leaving the previous autovacuum still showing active in pg_stat_activity. Manual vacuum always completes, never stalls, but also often does not appear in pg_stat_progress_vacuum unless it's a longer process.

It appears the process completes the vacuum but does not register that fact. relfrozenxid of the main table is what would be expected but an associated toast table was still very old. Cancelling all pending vacuums of the table and manually running vacuum completes in a few seconds and both the main relation and toast are updated as expected with last vacuum time updated in pg_stat_all_tables. pg_stat_progress_vacuum never had any entry. Autoacuum and autovacuum analyze both get hung.

I often see the same table listed multiple times in pg_stat_activity with different pids and:
state: active
backend_type: autovacuum worker
wait_event_type: blank
state_change: 10-15 seconds after backend_start - about how long it takes to manually run vacuum on most tables.

What exactly does autovacuum rely on in the stats file? I ran strace on a hung autovacuum process and saw a repeated read of /run/postgresql/db_16384.stat (tempfs). The file is 740MB which is about the same as other similar installations I've reviewed. I'm lacking in overall experience in this though.

One final oddity:
I ran a query for oldest relfrozenxid and redirected to file. The query took around a minute. A few seconds after it finished, I queried for 'active' in pg_stat_activity and the oldest relfrozenxid query was still listed. A few seconds later it had cleared.

Can a corrupted stats file prevent autovac from reading/writing? 

> I scripted a vacuum loop using the oldest table list. It's extremely slow but it was
> making better progress than autovacuum was.
>
> Using ps I see that there were as many worker processes as defined with autovacuum_max_workers
> but pg_stat_activity consistantly showed 19. I killed the script thinking there might be a conflict.
> I saw no difference after 30 minutes so restarted script.

I am not sure what exactly you are actually doing here, but you should know that there
can only be one VACUUM process per table.  If there is already an anti-wraparound autovacuum
running on the table, a manual VACUUM will simple be blocked until the autovacuum worker
is done.

> Never saw anything in pg_stat_progress_vacuum.

Now that would be weird, except if VACUUM cannot get the required lock on the table.

> vacuum settings:
>                 name                 |  setting
> -------------------------------------+-----------
>  autovacuum                          | on
>  autovacuum_freeze_max_age           | 200000000
>  autovacuum_max_workers              | 40
>  autovacuum_naptime                  | 4
>  autovacuum_vacuum_cost_delay        | 0
>  autovacuum_vacuum_cost_limit        | 5000
>  autovacuum_work_mem                 | -1
>  vacuum_freeze_min_age               | 50000000
>  vacuum_freeze_table_age             | 150000000
>
> I'm now thinking that autovacuum getting hung up is what caused the issue to begin with. I see nothing
> but the successful vacuums from the script and my own fat-fingering commands in the postgres
> logs (set at info).

Sorry about the reply formatting. I tried using outlook web in Edge. Maybe that was a mistake.
Thanks,
Senor