On Fri, 2022-11-04 at 02:50 +0000, senor wrote: > I'm still trying to get a better understanding of the autovacuum process. > This is a different postgres installation as my previous posts and confusing me in new ways. > Still 11.4 running on CentOS 7 and 8 nvme in software raid Not good. That should be 11.17. You are missing over two years of important bug fixes, which could be part of the problem. > This issue started with postgres "...not accepting commands to avoid wraparound...". That does not happen normally. Something must have prevented autovacuum from succeeding: - data corruption that makes VACUUM fail - a prepared transaction or long running transaction that prevents PostgreSQL from cleaning up - a long running session with a temporary table - autovacuum is too slow to cope with the rate at which dead tuples are generated > On this server I was able to stop all access to DB and dedicate resources to only postgres. > I thought I could allow autovacuum to do its thing with a ton of workers. If autovacuum is too slow, this will take a long time. Note that in v11, the default setting for "autovacuum_vacuum_cost_delay" is still 20ms, which makes autovacuum pretty slow. Later, you show that you have "autovacuum_vacuum_cost_delay" set to 0, which is good. Did you change that recently, so that some workers started before that change? You should also crank up "maintenance_work_mem" for autovacuum to be fast. > I think everything boils down to 2 questions: > 1. Can autovacuum or manual vacuum be coerced into dealing with oldest first? Manual VACUUM is the easiest. You can start it on the table with the oldest "relfrozenxid" first. > 1a. Where might I find advice on configuring postgres resources for maximum cpu & > memory maintenance use. In other words quickest path out of "not accepting commands" > land. Besides increasing autovacuum_freeze_max_age. The documentation has a lot about that. The quickest way is to run VACUUM on the table with the oldest "relfrozenxid" manually. I would run VACUUM (VERBOSE), so that you get information if it can clean up or freeze anything. > 2. What can cause autovacuum to stall? Could associated toast or index bne the cause. Potentially data corruption could send a backend into an endless loop. > It appeared that autovacuum was not choosing the tables with the oldest xmin so I > produced an ordered list of oldest tables with: > SELECT oid::regclass, age(relfrozenxid) > FROM pg_class > WHERE relkind IN ('r', 't', 'm') > AND age(relfrozenxid) > 2000000000 > ORDER BY 2 DESC > > The list contained over 6000 tables from pg_toast. They all belonged to daily > reports tables. The reports are created daily and not touched again. > > Most of the autovacuums that did start seem to be hung. Never completing even on the simplest tables. > The newest 2 autovacuums in the list are completing about one every couple seconds. > CPU and disk IO are nearly idle. To see if they are hung, look at "wait_event", "wait_event_type" and "state" in the "pg_stat_activity" rows for the autovacuum workers. High locks on a table have the potential to block an anti-wraparound autovacuum. Again, check for log running and prepared transactions. > I scripted a vacuum loop using the oldest table list. It's extremely slow but it was > making better progress than autovacuum was. > > Using ps I see that there were as many worker processes as defined with autovacuum_max_workers > but pg_stat_activity consistantly showed 19. I killed the script thinking there might be a conflict. > I saw no difference after 30 minutes so restarted script. I am not sure what exactly you are actually doing here, but you should know that there can only be one VACUUM process per table. If there is already an anti-wraparound autovacuum running on the table, a manual VACUUM will simple be blocked until the autovacuum worker is done. > Never saw anything in pg_stat_progress_vacuum. Now that would be weird, except if VACUUM cannot get the required lock on the table. > vacuum settings: > name | setting > -------------------------------------+----------- > autovacuum | on > autovacuum_freeze_max_age | 200000000 > autovacuum_max_workers | 40 > autovacuum_naptime | 4 > autovacuum_vacuum_cost_delay | 0 > autovacuum_vacuum_cost_limit | 5000 > autovacuum_work_mem | -1 > vacuum_freeze_min_age | 50000000 > vacuum_freeze_table_age | 150000000 > > I'm now thinking that autovacuum getting hung up is what caused the issue to begin with. I see nothing > but the successful vacuums from the script and my own fat-fingering commands in the postgres > logs (set at info). Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com