Autovacuum endless loop in heap_page_prune()?

Peter <pmc@xxxxxxxxxxxxxxxxxxxxxxx> · Sat, 25 May 2024 12:51:50 +0200

Good morning,

 I just found Autovacuum run for 6 hours on a 8 GB table, VACUUM query
doesnt cancel, cluster doesn't stop, autovacuum worker is not
killable, truss shows no activity, after kill -6 this backtrace:

* thread #1, name = 'postgres', stop reason = signal SIGABRT
  * frame #0: 0x0000000000548063 postgres`HeapTupleSatisfiesVacuumHorizon + 531
    frame #1: 0x000000000054aed9 postgres`heap_page_prune + 537
    frame #2: 0x000000000054e38a postgres`heap_vacuum_rel + 3626
    frame #3: 0x00000000006af382 postgres`vacuum_rel + 626
    frame #4: 0x00000000006aeeeb postgres`vacuum + 1611
    frame #5: 0x00000000007b4664 postgres`do_autovacuum + 4292
    frame #6: 0x00000000007b2342 postgres`AutoVacWorkerMain + 866
    frame #7: 0x00000000007b1f97 postgres`StartAutoVacWorker + 39
    frame #8: 0x00000000007ba0df postgres`sigusr1_handler + 783
    frame #9: 0x00000008220da627 libthr.so.3`___lldb_unnamed_symbol683 + 215
    frame #10: 0x00000008220d9b1a libthr.so.3`___lldb_unnamed_symbol664 + 314
    frame #11: 0x00007ffffffff913
    frame #12: 0x00000000007bba25 postgres`ServerLoop + 1541
    frame #13: 0x00000000007b9467 postgres`PostmasterMain + 3207
    frame #14: 0x000000000071a566 postgres`main + 758
    frame #15: 0x00000000004f9995 postgres`_start + 261

After restart, no problems reported yet.

Storyline:
this is the file-list table of my backup/archive system, contains ~50
mio records. Recently I found a flaw in the backup system, so that some
old records weren't removed. I wrote a script to do this, that script
did run first at 04:15 and reported it had now removed a lot of old
data. I looked into pgadmin4 and it reported 9 mio dead tuples.

I wondered why autovacuum wouldn't run, and the stats showed that
regular daily vacuum had last run at 03:15 and autovacuum at 03:18. I
wondered why it wouldn't start again, and went to sleep. Having
consumed 300 minutes cpu now at 11:00, it did start at around 05:00.

No messages of hardware errors, good old Haswell Xeon/EP with all ECC.
Could there be a compute error somewhere, and does autovacuum lock
out ordinary kill signals during some time?

We're at Rel. 15.6 on FreeBSD 13.3

 *** Changed values: 
shared_buffers = 40MB
temp_buffers = 20MB
work_mem = 50MB
maintenance_work_mem = 50MB
max_stack_depth = 40MB
dynamic_shared_memory_type = posix
max_files_per_process = 200
effective_io_concurrency = 5
synchronous_commit = off
wal_sync_method = fsync
full_page_writes = off
wal_compression = on
wal_init_zero = off
wal_writer_delay = 2000ms
checkpoint_timeout = 180min
checkpoint_completion_target = 0.0
max_wal_size = 2GB
archive_mode = on
archive_timeout = 86400
seq_page_cost = 0.5
random_page_cost = 0.7
effective_cache_size = 1GB
default_statistics_target = 1000
autovacuum = on
autovacuum_naptime = 5min
autovacuum_vacuum_scale_factor = 0.01
autovacuum_analyze_scale_factor = 0.05