On Fri, May 29, 2015 at 9:46 PM, Andres Freund <andres@xxxxxxxxxxx> wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem. Currently, > to determine where we need to truncate SlruScanDirectory() is > used. That, afaics, could actually be a problem during recovery when > we're not consistent. I agree. I actually meant to mention this in my previous email, but, owing to exhaustion and burnout, didn't. > I think at least for 9.5+ we should a) invent proper truncation records > for pg_multixact b) start storing oldestValidMultiOffset in pg_control. > The current hack of scanning the directories to get knowledge we should > have is a pretty bad hack, and we should not continue using it forever. > I think we might end up needing to do a) even in the backbranches. That may be the right thing to do. I'm concerned that changing the behavior of master too much will make it every subsequent fix twice as hard, because we'll have to do one fix in master and another fix in the back-branches. I'm also concerned that it will create even more convoluted failure scenarios. The failure-to-start problem discussed on this thread requires a chain of four (maybe three) different PostgreSQL versions in order to create it, and the more things we go change, the harder it's going to be to reason about this stuff. The diseased and rotting elephant in the room here is that clusters with bogus relminmxid, datminmxid, and/or oldestMultiXid values may exist in the wild and we really have no plan to get rid of them. 78db307bb may have helped somewhat - although I'm haven't grokked what it's about well enough to be sure - but it's certainly not a complete solution, as this bug report itself illustrates rather well. Unless we figure out some clever solution that is not now apparent to me, or impose a hard pg_upgrade compatibility break at some point, we basically can't count on pg_control's "oldest multixact" information to be correct ever again. We may be running into clusters 15 years from now that have problems that are just holdovers from what was fixed in 9.3.5. One thing I think we should definitely do is add one or two additional fields to pg_controldata that get filled in by pg_upgrade. One of them should be "the oldest known catversion in the lineage of this cluster" and the other should be "the most recent catverson in the lineage of this cluster before this one". Or maybe we should store PG_VERSION_NUM values. Or store both things. I think that would make troubleshooting this kind of problem a lot easier - just from the pg_controldata output, you'd be able to tell whether the cluster had been pg_upgraded, whether it had been pg_upgraded once or multiple times, and at least some of the versions involved, without relying on the user's memory of what they did and when. Fortunately, Steve Kellet had a pretty clear idea of what his history was, but not all users know that kind of thing, and I've wanted it more than once while troubleshooting. Another thing I think we should do is add a field to pg_class that is propagated by pg_upgrade and stores the most recent PG_VERSION_NUM that is known to have performed a scan_all vacuum of the table. This would allow us to do things in the future like (a) force a full-table vacuum of any table that hasn't been vacuumed since $BUGGYRELEASE or (b) advise users to manually inspect the values and manually perform said vacuum or (c) only believe that certain information about a table is accurate if it's been full-scanned by a vacuum newer than $BUGGYRELEASE. It could also be used as part of a strategy for reclaiming HEAP_MOVED_IN/HEAP_MOVED_OFF; e.g. you can't upgrade to 10.5, which repurposes those bits, unless you've done a scan_all vacuum on every table with a release new enough to guarantee that they're not used for their historical purpose. > This problem isn't conflicting with most of the fixes you describe, so > I'll continue with reviewing those. Thank you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general