On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake <jd@xxxxxxxxxxxxxxxxx> wrote: > FTR: Robert, you have been a Samurai on this issue. Our many thanks. Thanks! I really appreciate the kind words. So, in thinking through this situation further, it seems to me that the situation is pretty dire: 1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid or datminmxid values which are 1 instead of the correct value. Setting the value to 1 was too far in the past if your MXID counter is < 2B, and too far in the future if your MXID counter is > 2B. 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid values which are equal to the next-mxid counter instead of the correct value; in other words, they are two new. 3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will have the first problem for tables in template databases, and the second one for the rest. (See 866f3017a.) 4. Wrong relminmxid or datminmxid values can eventually propagate into the control file, as demonstrated in my previous post. Therefore, we can't count on relminmxid to be correct, we can't count on datminmxid to be correct, and we can't count on the control file to be correct. That's a sack of sad. 5. If the values are too far in the past, then nothing really terrible will happen unless you upgrade to 9.3.7 or 9.4.2, at which point the system will refuse to start. Forcing a VACUUM FREEZE on every database, including the unconnectable ones, should fix this and allow you to upgrade safely - which you want to do, because 9.3.7 and 9.4.2 fix a different set of multixact data loss bugs. 6. If the values are too far in the future, the system may fail to prevent wraparound, leading to data loss. I am not totally clear on whether a VACUUM FREEZE will fix this problem. It seems like the chances are better if you are running at least 9.3.5+ or 9.4.X, because of 78db307bb. But I'm not sure how complete a fix that is. So what do we do about this? I have a few ideas: A. Most obviously, we should fix pg_upgrade so that it installs chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so that we stop creating new instances of this problem. That won't get us out of the hole we've dug for ourselves, but we can at least try to stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the wrong thing - anyone want to double-check me on that one?) B. We need to change find_multixact_start() to fail softly. This is important because it's legitimate for it to fail in recovery, as discussed upthread, and also because we probably want to eliminate the fail-to-start hazard introduced in 9.4.2 and 9.3.7. find_multixact_start() is used in three places, and they each require separate handling: - In SetMultiXactIdLimit, find_multixact_start() is used to set MultiXactState->oldestOffset, which is used to determine how aggressively to vacuum. If find_multixact_start() fails, we don't know how aggressively we need to vacuum to prevent members wraparound; it's probably best to decide to vacuum as aggressively as possible. Of course, if we're in recovery, we won't vacuum either way; the fact that it fails softly is good enough. - In DetermineSafeOldestOffset, find_multixact_start() is used to set MultiXactState->offsetStopLimit. If it fails here, we don't know when to refuse multixact creation to prevent wraparound. Again, in recovery, that's fine. If it happens in normal running, it's not clear what to do. Refusing multixact creation is an awfully blunt instrument. Maybe we can scan pg_multixact/offsets to determine a workable stop limit: the first file greater than the current file that exists, minus two segments, is a good stop point. Perhaps we ought to use this mechanism here categorically, not just when find_multixact_start() fails. It might be more robust than what we have now. - In TruncateMultiXact, find_multixact_start() is used to set the truncation point for the members SLRU. If it fails here, I'm guessing the right solution is not to truncate anything - instead, rely on intense vacuuming to eventually advance oldestMXact to a value whose member data still exists; truncate then. C. I think we should also change TruncateMultiXact() to truncate offsets first, and then members. As things stand, if we truncate members first, we increase the risk of seeing an offset that will fail when passed to find_multixact_start(), because TruncateMultiXact() might get interrupted before it finishes. That seem like an unnecessary risk. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general