Re: autovacuum hung on simple tables

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Fri, 04 Nov 2022 09:50:34 +0100

On Fri, 2022-11-04 at 02:50 +0000, senor wrote:
> I'm still trying to get a better understanding of the autovacuum process. 
> This is a different postgres installation as my previous posts and confusing me in new ways.
> Still 11.4 running on CentOS 7 and 8 nvme in software raid

Not good.  That should be 11.17.  You are missing over two years of important bug fixes,
which could be part of the problem.

> This issue started with postgres "...not accepting commands to avoid wraparound...".

That does not happen normally.  Something must have prevented autovacuum from succeeding:

- data corruption that makes VACUUM fail
- a prepared transaction or long running transaction that prevents PostgreSQL from
  cleaning up
- a long running session with a temporary table
- autovacuum is too slow to cope with the rate at which dead tuples are generated

> On this server I was able to stop all access to DB and dedicate resources to only postgres.
> I thought I could allow autovacuum to do its thing with a ton of workers.

If autovacuum is too slow, this will take a long time.  Note that in v11, the default setting
for "autovacuum_vacuum_cost_delay" is still 20ms, which makes autovacuum pretty slow.
Later, you show that you have "autovacuum_vacuum_cost_delay" set to 0, which is good.
Did you change that recently, so that some workers started before that change?

You should also crank up "maintenance_work_mem" for autovacuum to be fast.

> I think everything boils down to 2 questions:
> 1. Can autovacuum or manual vacuum be coerced into dealing with oldest first?

Manual VACUUM is the easiest.  You can start it on the table with the oldest
"relfrozenxid" first.

>     1a. Where might I find advice on configuring postgres resources for maximum cpu &
>         memory maintenance use. In other words quickest path out of "not accepting commands"
>         land. Besides increasing autovacuum_freeze_max_age.

The documentation has a lot about that.
The quickest way is to run VACUUM on the table with the oldest "relfrozenxid" manually.
I would run VACUUM (VERBOSE), so that you get information if it can clean up or
freeze anything.

> 2. What can cause autovacuum to stall? Could associated toast or index bne the cause.

Potentially data corruption could send a backend into an endless loop.

> It appeared that autovacuum was not choosing the tables with the oldest xmin so I
> produced an ordered list of oldest tables with:
> SELECT oid::regclass, age(relfrozenxid)
> FROM pg_class
> WHERE relkind IN ('r', 't', 'm')
> AND age(relfrozenxid) > 2000000000
> ORDER BY 2 DESC
> 
> The list contained over 6000 tables from pg_toast. They all belonged to daily
> reports tables. The reports are created daily and not touched again.
> 
> Most of the autovacuums that did start seem to be hung. Never completing even on the simplest tables. 
> The newest 2 autovacuums in the list are completing about one every couple seconds.
> CPU and disk IO are nearly idle.

To see if they are hung, look at "wait_event", "wait_event_type" and "state" in the
"pg_stat_activity" rows for the autovacuum workers.

High locks on a table have the potential to block an anti-wraparound autovacuum.
Again, check for log running and prepared transactions.

> I scripted a vacuum loop using the oldest table list. It's extremely slow but it was
> making better progress than autovacuum was.
> 
> Using ps I see that there were as many worker processes as defined with autovacuum_max_workers
> but pg_stat_activity consistantly showed 19. I killed the script thinking there might be a conflict.
> I saw no difference after 30 minutes so restarted script.

I am not sure what exactly you are actually doing here, but you should know that there
can only be one VACUUM process per table.  If there is already an anti-wraparound autovacuum
running on the table, a manual VACUUM will simple be blocked until the autovacuum worker
is done.

> Never saw anything in pg_stat_progress_vacuum.

Now that would be weird, except if VACUUM cannot get the required lock on the table.

> vacuum settings:
>                 name                 |  setting  
> -------------------------------------+-----------
>  autovacuum                          | on        
>  autovacuum_freeze_max_age           | 200000000 
>  autovacuum_max_workers              | 40        
>  autovacuum_naptime                  | 4         
>  autovacuum_vacuum_cost_delay        | 0         
>  autovacuum_vacuum_cost_limit        | 5000      
>  autovacuum_work_mem                 | -1        
>  vacuum_freeze_min_age               | 50000000  
>  vacuum_freeze_table_age             | 150000000 
> 
> I'm now thinking that autovacuum getting hung up is what caused the issue to begin with. I see nothing
> but the successful vacuums from the script and my own fat-fingering commands in the postgres
> logs (set at info).

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com