Hi, On 2019-07-17 13:27:23 -0400, Tom Lane wrote: > My concern here is that if we have blinders on to the extent of only > processing that one table or DB, we're unnecessarily allowing bloat to > occur in other tables, and causing that missed vacuuming work to pile > up so that there's more of it to be done once the breakage is cleared. That a pretty common problem in the real world, with or without problems like corrupted indexes. Autovacuum's scheduling is just too simplistic to avoid that. Common problems: - If one table in one database has an xmin older than autovacuum_freeze_max_age, the autovacuum launcher continuously throws more workers at that database. Even though there's a lot of work to be done in other databases. That's because do_start_worker() ignores the "adl_next_worker" mechanism, and *always* starts more workers for the database with the oldest datfrozenxid (same with datminmxid), and because we explicitly don't start workers for other databases ("/* ignore not-at-risk DBs */"). That's especially bad if that database has a longrunning transaction preventing cleanup from happening - the vacuuming will not be able to advance the horizon, and all others are starved. - If there's a table "early in pg_class", and vacuum on it failes, it will prevent vacuuming other tables. It's pretty common for vacuum to fail often for a table, e.g. because it's sometimes exclusively locked, which then causes autovacuum to kill itself. There's absolutely no mechanism for autovacuum workers to skip over that table for a while, leading to all other tables in the database not being vacuumed, unless there happens to be second worker in the database, while the first vacuum hasn't failed. This obviously also applies to the corrupted index case. The 'kill itself' behaviour is exascerbated by lazy_truncate_heap()'s exclusive lock - which will obviously trigger other backend to send cancellation requests. There's unfortunately a number of places where that leads us to just throw all the work done away, and not update pg_class.relfrozenxid/datfrozenxid - Anti-wraparound vacuums are more impactful (don't cancel themselves upon lock conflicts, cause more IO, wait for cleanup locks), often emit scary messages ("oldest xmin is far in the past"). But we don't have *any* mechanism that avoids them for very common scenarios. E.g. for insert-mostly workloads, there'll never be enough dead tuples to cause a vacuum to happen before autovacuum_freeze_max_age is reached. That prevents vacuum_freeze_table_age from pre-empting the need to do an anti-wraparound vacuum, by increasing the xid horizon. We literally call anti-wraparound autovacuums "emergency" in the code, yet they're an almost inevitablepart of running postgres. - There's no meaningful in-database prioritization. Which means that autovacuum workers might happily vacuum the table just a bit over the thresholds, even though there's much worse tables around. Especially on a busy and large databases that can lead to anti-wraparound started launchers effectively never getting to vacuuming tables above autovacuum_freeze_max_age, because tables earlier in pg_class are modified heavily enough that they have dead tuples above the thresholds by the time vacuum finishes. To get to the anti-wraparound vacuum needing table, a single launcher needs to go through all tables preceding the table in pg_class that need vacuuming (only skipping over ones that are concurrently vacuumed by somebody else, but not ones that have *recently* been vacuumed). I kinda forgot how bad this one was until looking at the code again. And there are plenty more. My impression is that these are really hard to fix unless we develop a new scheduling approach. And that scheduling approach probably needs to be more stateful than the current code. IMO these are the main concerns for how work needs to be distributed: - work in different databases needs to be scheduled in a smarter way, in particular anti-wraparound cannot simply cause only the database with the to oldest datfrozenxid to be vacuumed until the wraparound "danger" is over (especially not if there's other database needing to be anti-wrap vacuumed) - tables within a database need to be prioritized in a smarter way, so databases with a lot of bloat get vacuumed before ones with a lot less bloat, and similarly tables with the oldest horizon need to be vacuumed before ones with newer horizons, even if all of the tables are above the autovacuum thresholds. - tables need to be prioritized across databases, to avoid problems like one currently vacuumed table causing unnecessary anti-wraparound workers to be launched for a database where they can't perform any work. - there needs to be a separate rung between normal autovacuums and anti-wraparound vacuums. The inbetween level does 'aggressive' vacuuming (so it freezes), even for tables with just inserts, but keeps interruptible, and doesn't have a scary 'anti-wraparound' name. - Repeated failures on one table may not starve all work for other tables in a database. - errors vacuuming one table may not cause vacuuming on tables with a lower priority from happening until that error is fixed. To me that means that we need prioritization across databases, and between tables, and probably by multiple criteria. I suspect there need to be multiple criteria how urgent vacuuming is, and autovacuum ought to try to make progress on all of them. I suspect the way to implement that would be to have avlauncher start workers for all databases independent of tasks needing to be done (but at a slower schedule than autovacuum_naptime). Those workers then would put the tables needing vacuuming in the database into the global prioritized list, displacing entries that are less important once the list is full. Proper autovacuuming work would then be done strictly in priority order. To avoid one priority criteria (e.g. age(relfrozenxid) from preventing any work done on other criteria (e.g. number of dead tuples), I suspect we ought to round robin between different criterias. E.g. have one list of tables needing vacuum ordered by age(relfrozenxid), one by %dead-tuples, and one by last-failure time, and have workers try to make progress on all. If that sounds familiar to what is being argued about around UNDO processing in workers, that'd not be an accident. A lesser, but still pretty pressing concern, is that we currently have no system that maintains the visibilitymap readily for insert-mostly tables. That means that a) index-only scans aren't going to happen for such tables, unless they're manually vacuumed b) the anti-wraparound vacuum that eventually is going to happen, is going to be extremely expensive/slow. Regards, Andres