Re: warm standby server stops doing checkpoints afterawhile

"Simon Riggs" <simon@xxxxxxxxxxxxxxx> · Fri, 01 Jun 2007 12:44:31 +0100

On Fri, 2007-06-01 at 13:33 +0200, Frank Wittig wrote:
> Simon Riggs schrieb:
> 
> > This is repeatable, yes?
> Yes, it occures every time I begin with a new base backup. And it seem
> to happen during recreation of tsearch2 vectors of large amounts of data
> sets.
> 
> > Has anything crashed on your server?
> No. Crashes didn't occur duriung that times.
> 
> > Are you using GIN or GIST indexes?
> I'm using GIN indesex on tsearch2 vectors of very large ammount of data
> sets. (About 3,8 million data sets of which about 30-50 thousands are
> recreated and indexed when the descibed behavior occures.)
> 
> > I'll look at putting some debug information in there that logs whether
> > multi-WAL actions remain unresolved for any length of time.
> Extra debug info would be great.
> I tested myself adding some debug output into the function Tom Lane
> mentioned and found that after the server stopped checkpointing every
> time the function is called it exits at this point:
> 
>   /*
>    * Is it safe to checkpoint?  We must ask each of the resource managers
>    * whether they have any partial state information that might prevent a
>    * correct restart from this point.  If so, we skip this opportunity, but
>    * return at the next checkpoint record for another try.
>    */
>   for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
>   {
>     if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
>       if (!(RmgrTable[rmid].rm_safe_restartpoint()))
>         return;
>   }
> 
> It exits every time with the same value for rmid.
> Logs look like this (The quoted lines repeat):
> 
> <2007-06-01 13:10:28.936 CEST:%> DEBUG:  00000: executing restore
> command "/var/lib/pgsql/restore.pl
> /mnt/wal_archive/00000001000000C9000000C2 pg_xlog/RECOVERYXLOG"
> <2007-06-01 13:10:28.936 CEST:%> LOCATION:  RestoreArchivedFile, xlog.c:2474
> <2007-06-01 13:11:29.055 CEST:%> LOG:  00000: restored log file
> "00000001000000C9000000C2" from archive
> <2007-06-01 13:11:29.055 CEST:%> LOCATION:  RestoreArchivedFile, xlog.c:2504
> <2007-06-01 13:11:29.364 CEST:%> DEBUG:  00000: found Checkpoint in XLOG
> <2007-06-01 13:11:29.364 CEST:%> CONTEXT:  xlog redo checkpoint: redo
> C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8;
> offset 15; online
> <2007-06-01 13:11:29.364 CEST:%> LOCATION:  RecoveryRestartPoint,
> xlog.c:5739
> <2007-06-01 13:11:29.365 CEST:%> DEBUG:  00000: Ressource manager (13)
> has partial state information
> <2007-06-01 13:11:29.365 CEST:%> CONTEXT:  xlog redo checkpoint: redo
> C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8;
> offset 15; online
> <2007-06-01 13:11:29.365 CEST:%> LOCATION:  RecoveryRestartPoint,
> xlog.c:5769

To me, this points clearly to there being an improperly completed action
in resource manager 13. (GIN) In summary, it appears that there may be
an issue with the GIN code for WAL recovery and this is effecting the
Warm Standby.

I'll reply again soon with a patch to allow data of incomplete splits to
be logged so we can see what's going on. After that we should continue
to discuss this on -hackers.

-- 
  Simon Riggs             
  EnterpriseDB   http://www.enterprisedb.com