UTC4115FATAL: the database system is in recovery mode

Mathew Samuel <Mathew.Samuel@xxxxxxxxxxx> · Mon, 30 May 2011 10:29:02 -0400

Hi,

I see the following 
error as found in pg.log:
UTC4115FATAL:  
the database system is in recovery mode

Actually that 
message was logged repeatedly for about 4 hours according to the logs (I don't 
have access to the system itself, just the logs).

Leading up to that 
error were the following in pg.log:
2011-03-28 10:44:06 
UTC3609LOG:  checkpoints are occurring too frequently (11 seconds 
apart)
2011-03-28 10:44:06 UTC3609HINT:  Consider increasing the 
configuration parameter "checkpoint_segments".
2011-03-28 10:44:18 
UTC3609LOG:  checkpoints are occurring too frequently (12 seconds 
apart)
2011-03-28 10:44:18 UTC3609HINT:  Consider increasing the 
configuration parameter "checkpoint_segments".
2011-03-28 10:44:28 
UTC3609LOG:  checkpoints are occurring too frequently (10 seconds 
apart)
2011-03-28 10:44:28 UTC3609HINT:  Consider increasing the 
configuration parameter "checkpoint_segments".
2011-03-28 10:44:38 
UTC3609LOG:  checkpoints are occurring too frequently (10 seconds 
apart)
2011-03-28 10:44:38 UTC3609HINT:  Consider increasing the 
configuration parameter "checkpoint_segments".
2011-03-28 10:44:42 
UTC3932ERROR:  canceling statement due to statement timeout
2011-03-28 
10:44:42 UTC3932STATEMENT:  vacuum full analyze 
_zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC:  cannot abort 
transaction 1827110275, it was already committed
2011-03-28 10:44:42 
UTC3566LOG:  server process (PID 3932) was terminated by signal 
6
2011-03-28 10:44:42 UTC3566LOG:  terminating any other active server 
processes
2011-03-28 10:44:42 UTC13142WARNING:  terminating connection 
because of crash of another server process
2011-03-28 10:44:42 
UTC13142DETAIL:  The postmaster has commanded this server process to roll 
back the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 
UTC13142HINT:  In a moment you should be able to reconnect to the database 
and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING:  
terminating connection because of crash of another server process
2011-03-28 
10:44:42 UTC29834DETAIL:  The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process 
exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 
UTC29834HINT:  In a moment you should be able to reconnect to the database 
and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING:  
terminating connection because of crash of another server process
2011-03-28 
10:44:42 UTC3553DETAIL:  The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process 
exited abnormally and possibly corrupted shared memory.

In fact those last 3 
lines are repeated over and over again repeatedly until "UTC4115FATAL:  the database 
system is in recovery mode" is logged for 4 hours. At some point, 
4 hours later of course, it appears that the system 
recovers.

The Checkpoints 
Settings in postgresql.conf are commented out so I guess the defaults are being 
used:
#checkpoint_segments 
= 
3                
# in logfile segments, min 1, 16MB each
#checkpoint_timeout = 
5min              
# range 30s-1h
#checkpoint_warning = 
30s               
# 0 is off

That system where 
this was seen was using pgsql-8.2.6 on RHEL4.

Not sure if this is 
a known bug (or if it is a bug at all or something I can address using different 
configuration) but I thought I would post here first if any one might be 
familiar with this issue and suggest a possible solution. Any 
ideas?

Cheers,
Matt