Hi,
I see the following
error as found in pg.log:
UTC4115FATAL:
the database system is in recovery mode
Actually that
message was logged repeatedly for about 4 hours according to the logs (I don't
have access to the system itself, just the logs).
Leading up to that
error were the following in pg.log:
2011-03-28 10:44:06
UTC3609LOG: checkpoints are occurring too frequently (11 seconds
apart)
2011-03-28 10:44:06 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG: checkpoints are occurring too frequently (12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement timeout
2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, it was already committed
2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated by signal 6
2011-03-28 10:44:42 UTC3566LOG: terminating any other active server processes
2011-03-28 10:44:42 UTC13142WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC13142DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC13142HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC29834DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC29834HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC3553DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:06 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG: checkpoints are occurring too frequently (12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement timeout
2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, it was already committed
2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated by signal 6
2011-03-28 10:44:42 UTC3566LOG: terminating any other active server processes
2011-03-28 10:44:42 UTC13142WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC13142DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC13142HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC29834DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC29834HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC3553DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
In fact those last 3
lines are repeated over and over again repeatedly until "UTC4115FATAL: the database
system is in recovery mode" is logged for 4 hours. At some point,
4 hours later of course, it appears that the system
recovers.
The Checkpoints
Settings in postgresql.conf are commented out so I guess the defaults are being
used:
#checkpoint_segments
=
3
# in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_warning = 30s # 0 is off
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_warning = 30s # 0 is off
That system where
this was seen was using pgsql-8.2.6 on RHEL4.
Not sure if this is
a known bug (or if it is a bug at all or something I can address using different
configuration) but I thought I would post here first if any one might be
familiar with this issue and suggest a possible solution. Any
ideas?
Cheers,
Matt