On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera <alvherre@xxxxxxxxxxxxxxx> wrote: > Well I'm not very clear on what's the problematic case. The scenario I > actually saw this first reported was a pg_basebackup taken on a very > large database, so the master could have truncated multixact and the > standby receives a truncated directory but actually tries to apply a > checkpoint that is much older than what the master currently has > transmitted as pg_multixact contents. OK, that makes sense. >> That might be an OK fix, but this implementation doesn't seem very >> clean. If we're going to remove the invariant that >> MultiXactState->oldestOffset will always be valid after replaying a >> checkpoint, then we should be explicit about that and add a flag >> indicating whether or not it's currently valid. Shoving nextOffset in >> there and hoping that's good enough seems like a bad idea to me. >> >> I think we should modify the API for find_multixact_start. Let's have >> it return a Boolean and return oldestOffset via an out parameter. If >> !InRecovery, it will always return true and set the out parameter; but >> if in recovery, it is allowed to return false without setting the out >> parameter. Both values can get stored in MultiXactState, and we can >> adjust the logic elsewhere to disregard oldestOffset when the >> accompanying flag is false. > > Sounds good. I think I prefer that multixact creation is rejected > altogether if the new flag is false. Is that what you mean when you say > "adjust the logic"? No. I'm not sure quite what you mean here. We can't reject multixact creation during normal running, and during recovery, we won't create any really new mulitxacts, but we must replay the creation of multixacts. What I meant was stuff like this: if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) || (MultiXactState->nextOffset - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD)) I meant that we'd change the second prong of the test to check multiXactState->nextOffsetValid && MultiXactState->nextOffset - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD. And likewise change anything else that relies on oldestOffset. Or else we guarantee that we can't reach those points until the oldestOffset is valid, and then check that it is with an Assert() or elog(). >> This still leaves open an ugly possibility: can we reach normal >> running without a valid oldestOffset? If so, until the next >> checkpoint happens, autovacuum has no clue whether it needs to worry. >> There's got to be a fix for that, but it escapes me at the moment. > > I think the fix to that issue is to set the oldest offset on > TrimMultiXact. That way, once WAL replay finished we're certain that we > have a valid oldest offset to create new multixacts with. > > I'm also wondering whether the call to DetermineSafeOldestOffset on > StartupMultiXact is good. At that point, we haven't replayed any WAL > yet, so the oldest multi might be pointing at a file that has already > been removed -- again considering the pg_basebackup scenario where the > multixact files are copied much later than pg_control, so the checkpoint > to replay is old but the pg_multixact contents have already been > truncated in the master and are copied truncated. Moving the call from StartupMultiXact() to TrimMultiXact() seems like a good idea. I'm not sure why we didn't do that before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general