Robert Haas wrote: > On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera > <alvherre@xxxxxxxxxxxxxxx> wrote: > > Steve Kehlet wrote: > >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we > >> just dropped new binaries in place) but it wouldn't start up. I found this > >> in the logs: > >> > >> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: > >> database system was shut down at 2015-05-27 13:12:55 PDT > >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL: the database system is > >> starting up > >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL: could not access status of > >> transaction 1 > > > > I am debugging today a problem currently that looks very similar to > > this. AFAICT the problem is that WAL replay of an online checkpoint in > > which multixact files are removed fails because replay tries to read a > > file that has already been removed. > > Hmm, so what exactly is the sequence of events here? It's possible > that I'm not thinking clearly just now, but it seems to me that if > we're replaying the same checkpoint we replayed previously, the offset > of the oldest multixact will be the first file that we didn't remove. Well I'm not very clear on what's the problematic case. The scenario I actually saw this first reported was a pg_basebackup taken on a very large database, so the master could have truncated multixact and the standby receives a truncated directory but actually tries to apply a checkpoint that is much older than what the master currently has transmitted as pg_multixact contents. > > I think the fix to this is to verify whether the file exists on disk > > before reading it; if it doesn't, assume the truncation has already > > happened and that it's not necessary to remove it. > > That might be an OK fix, but this implementation doesn't seem very > clean. If we're going to remove the invariant that > MultiXactState->oldestOffset will always be valid after replaying a > checkpoint, then we should be explicit about that and add a flag > indicating whether or not it's currently valid. Shoving nextOffset in > there and hoping that's good enough seems like a bad idea to me. > > I think we should modify the API for find_multixact_start. Let's have > it return a Boolean and return oldestOffset via an out parameter. If > !InRecovery, it will always return true and set the out parameter; but > if in recovery, it is allowed to return false without setting the out > parameter. Both values can get stored in MultiXactState, and we can > adjust the logic elsewhere to disregard oldestOffset when the > accompanying flag is false. Sounds good. I think I prefer that multixact creation is rejected altogether if the new flag is false. Is that what you mean when you say "adjust the logic"? > This still leaves open an ugly possibility: can we reach normal > running without a valid oldestOffset? If so, until the next > checkpoint happens, autovacuum has no clue whether it needs to worry. > There's got to be a fix for that, but it escapes me at the moment. I think the fix to that issue is to set the oldest offset on TrimMultiXact. That way, once WAL replay finished we're certain that we have a valid oldest offset to create new multixacts with. I'm also wondering whether the call to DetermineSafeOldestOffset on StartupMultiXact is good. At that point, we haven't replayed any WAL yet, so the oldest multi might be pointing at a file that has already been removed -- again considering the pg_basebackup scenario where the multixact files are copied much later than pg_control, so the checkpoint to replay is old but the pg_multixact contents have already been truncated in the master and are copied truncated. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general