On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera <alvherre@xxxxxxxxxxxxxxx> wrote: >> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid >> values which are equal to the next-mxid counter instead of the correct >> value; in other words, they are too new. > > [ discussion of how the control file's oldestMultiXid gets set ] I'm talking about the datminmxid in pg_database. You're talking about the contents of pg_control. Those are two different things. The relevant code is not what you quote, but rather this: /* set pg_database.datminmxid */ PQclear(executeQueryOrDie(conn_template1, "UPDATE pg_catalog.pg_database " "SET datminmxid = '%u'", old_cluster.controldata.chkpnt_nxtmulti)); Tom previously observed this to be wrong, here: http://www.postgresql.org/message-id/9879.1405877821@xxxxxxxxxxxxx Although Tom was correct to note that it's wrong, nothing ever got fixed. :-( >> A. Most obviously, we should fix pg_upgrade so that it installs >> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so >> that we stop creating new instances of this problem. That won't get >> us out of the hole we've dug for ourselves, but we can at least try to >> stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the >> wrong thing - anyone want to double-check me on that one?) > > I don't think there's anything that we need to fix here. I see your followup now agreeing this is broken. Since I wrote the previous email, I've had two new ideas that I think are both better than the above. 1. Figure out the oldest multixact offset that actually exists in pg_multixacts/offsets, and use that value. If any older MXIDs still exist, they won't be able to be looked up anyway, so if they wrap around, it doesn't matter. The only value that needs to be reliable in order to do this is pg_controldata's NextMultiXactId, which to the best of my knowledge is not implicated in any of these bugs. pg_upgrade can check that the offsets file containing that value exists, and if not bail out. Then, start stepping backwards a file at a time. When it hits a missing file, the first multixact in the next file is a safe value of datfrozenxid for every database in the new cluster. If older MXIDs exist, they're unreadable anyway, so if they wrap, nothing lost. If the value is older than necessary, the first vacuum in each database will fix it. We have to be careful: if we step back too many files, such that our proposed datfrozenxid might wrap, then we've got a confusing situation and had better bail out - or at least think really carefully about what to do. 2. When we're upgrading from a version 9.3 or higher, copy the EXACT datminmxid from each old database to the corresponding new database. This seems like it ought to be really simple. >> - In DetermineSafeOldestOffset, find_multixact_start() is used to set >> MultiXactState->offsetStopLimit. If it fails here, we don't know when >> to refuse multixact creation to prevent wraparound. Again, in >> recovery, that's fine. If it happens in normal running, it's not >> clear what to do. Refusing multixact creation is an awfully blunt >> instrument. Maybe we can scan pg_multixact/offsets to determine a >> workable stop limit: the first file greater than the current file that >> exists, minus two segments, is a good stop point. Perhaps we ought to >> use this mechanism here categorically, not just when >> find_multixact_start() fails. It might be more robust than what we >> have now. > > Blunt instruments have the desirable property of being simple. We don't > want any more clockwork here, I think --- this stuff is pretty > complicated already. As far as I understand, if during normal running > we see that find_multixact_start has failed, sufficient vacuuming should > get it straight eventually with no loss of data. Unfortunately, I don't believe that to be true. If find_multixact_start() fails, we have no idea how close we are to the member wraparound point. Sure, we can start vacuuming, but the user can be creating new, large multixacts at top speed while we're doing that, which could cause us to wrap around before we can finish vacuuming. Furthermore, if we adopted the blunt instrument, users who are in this situation would update to 9.4.3 (or whenever these fixes get released) and find that they can't create new MXIDs for a possibly very protracted period of time. That amounts to an outage for which users won't thank us. Looking at the files in the directory seems pretty simple in this case, and quite a bit more fail-safe than what we're doing right now. The current logic purports to leave a one-file gap in the member space, but there's no guarantee that the gap really exists on disk the way we think it does. With this approach, we can be certain that there is a gap. And that is a darned good thing to be certain about. >> C. I think we should also change TruncateMultiXact() to truncate >> offsets first, and then members. As things stand, if we truncate >> members first, we increase the risk of seeing an offset that will fail >> when passed to find_multixact_start(), because TruncateMultiXact() >> might get interrupted before it finishes. That seem like an >> unnecessary risk. > > Not sure about this point. We did it the way you propose previously, > and found it to be a problem because sometimes we tried to read an > offset file that was no longer there. Do we really read member files > anywhere? I thought we only tried to read offset files. If we remove > member files, what is it that we try to read and find not to be present? Do you have a link to the previous discussion? I mean, the problem we're having right now is that sometimes we have an offset, but the corresponding member isn't there. So clearly offsets reference members. Do members also reference offsets? I didn't think so, but life is full of surprises. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general