autovacuum hung on simple tables

senor <frio_cervesa@xxxxxxxxxxx> · Fri, 4 Nov 2022 02:50:26 +0000

Hi All,

I'm still trying to get a better understanding of the autovacuum process. 
This is a different postgres installation as my previous posts and confusing me in new ways.
Still 11.4 running on CentOS 7 and 8 nvme in software raid

This issue started with postgres "...not accepting commands to avoid wraparound...".
On this server I was able to stop all access to DB and dedicate resources to only postgres. I thought I could allow autovacuum to do its thing with a ton of workers. 

I think everything boils down to 2 questions:
1. Can autovacuum or manual vacuum be coerced into dealing with oldest first?
    1a. Where might I find advice on configuring postgres resources for maximum cpu & memory maintenance use. In other words quickest path out of "not accepting commands" land. Besides increasing autovacuum_freeze_max_age.
2. What can cause autovacuum to stall? Could associated toast or index bne the cause.

It appeared that autovacuum was not choosing the tables with the oldest xmin so I produced an ordered list of oldest tables with:
SELECT oid::regclass, age(relfrozenxid)
FROM pg_class
WHERE relkind IN ('r', 't', 'm')
AND age(relfrozenxid) > 2000000000
ORDER BY 2 DESC

The list contained over 6000 tables from pg_toast. They all belonged to daily reports tables. The reports are created daily and not touched again.

Most of the autovacuums that did start seem to be hung. Never completing even on the simplest tables. 
The newest 2 autovacuums in the list are completing about one every couple seconds.
CPU and disk IO are nearly idle.
An example table is shown here:

phantom=# select
phantom-#       pg_size_pretty(pg_total_relation_size(relid)) as total_size,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'main')) as relation_size_main,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'fsm')) as relation_size_fsm,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'vm')) as relation_size_vm,
phantom-#       pg_size_pretty(pg_relation_size(relid, 'init')) as relation_size_init,
phantom-#       pg_size_pretty(pg_table_size(relid)) as table_size,
phantom-#       pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size
phantom-#  from
phantom-#       pg_catalog.pg_statio_user_tables
phantom-# where
phantom-#   relname like 'report_user_439';
 total_size | relation_size_main | relation_size_fsm | relation_size_vm | relation_size_init | table_size | external_size
------------+--------------------+-------------------+------------------+--------------------+------------+---------------
 80 kB      | 8192 bytes         | 24 kB             | 8192 bytes       | 0 bytes            | 48 kB      | 72 kB
(1 row)

I scripted a vacuum loop using the oldest table list. It's extremely slow but it was making better progress than autovacuum was.

Using ps I see that there were as many worker processes as defined with autovacuum_max_workers but pg_stat_activity consistantly showed 19. I killed the script thinking there might be a conflict. I saw no difference after 30 minutes so restarted script. Never saw anything in pg_stat_progress_vacuum.

vacuum settings:
                name                 |  setting  
-------------------------------------+-----------
 autovacuum                          | on        
 autovacuum_analyze_scale_factor     | 0.1       
 autovacuum_analyze_threshold        | 50        
 autovacuum_freeze_max_age           | 200000000 
 autovacuum_max_workers              | 40        
 autovacuum_multixact_freeze_max_age | 400000000 
 autovacuum_naptime                  | 4         
 autovacuum_vacuum_cost_delay        | 0         
 autovacuum_vacuum_cost_limit        | 5000      
 autovacuum_vacuum_scale_factor      | 0.2       
 autovacuum_vacuum_threshold         | 50        
 autovacuum_work_mem                 | -1        
 log_autovacuum_min_duration         | 0         
 vacuum_cleanup_index_scale_factor   | 0.1       
 vacuum_cost_delay                   | 0         
 vacuum_cost_limit                   | 200       
 vacuum_cost_page_dirty              | 20        
 vacuum_cost_page_hit                | 1         
 vacuum_cost_page_miss               | 10        
 vacuum_defer_cleanup_age            | 0         
 vacuum_freeze_min_age               | 50000000  
 vacuum_freeze_table_age             | 150000000 
 vacuum_multixact_freeze_min_age     | 5000000   
 vacuum_multixact_freeze_table_age   | 150000000 

I'm now thinking that autovacuum getting hung up is what caused the issue to begin with. I see nothing but the successful vacuums from the script and my own fat-fingering commands in the postgres logs (set at info).

Any hints are appreciated.
Senor