On Thu, 2007-05-31 at 10:23 -0400, Tom Lane wrote: > Frank Wittig <fw@xxxxxxxxxxxx> writes: > > The problem is that the slave server stops checkpointing after some > > hours of working (about 24 to 48 hours of conitued log replay). > > Hm ... look at RecoveryRestartPoint() in xlog.c. Could there be > something wrong with this logic? > > /* > * Do nothing if the elapsed time since the last restartpoint is less than > * half of checkpoint_timeout. (We use a value less than > * checkpoint_timeout so that variations in the timing of checkpoints on > * the master, or speed of transmission of WAL segments to a slave, won't > * make the slave skip a restartpoint once it's synced with the master.) > * Checking true elapsed time keeps us from doing restartpoints too often > * while rapidly scanning large amounts of WAL. > */ > elapsed_secs = time(NULL) - ControlFile->time; > if (elapsed_secs < CheckPointTimeout / 2) > return; > > The idea is that the slave (once in sync with the master) ought to > checkpoint every time it sees a checkpoint record in the master's > output. I'm not seeing a flaw but maybe there is one here, or somewhere > nearby. Are you sure the master is checkpointing? Hmmm. This can happen if a backend crashes while half-way through any set of changes that causes safe_restartpoint() to be true. Or it might be that one of the Index AMs don't correctly clear the multi-WAL actions in some corner cases. Or it could be that the mdsync looping problem has been worse than we thought and checkpoints have been avoided completely for some time. Frank, This is repeatable, yes? Has anything crashed on your server? Are you using GIN or GIST indexes? I'll look at putting some debug information in there that logs whether multi-WAL actions remain unresolved for any length of time. Continuing to think about this one.... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com