Re: UTC4115FATAL: the database system is in recovery mode

Mathew Samuel <Mathew.Samuel@xxxxxxxxxxx> · Tue, 31 May 2011 10:46:30 -0400

Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> On 05/30/2011 10:29 PM, Mathew Samuel wrote:
> 
>> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration 
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too 
>> frequently (10 seconds apart)
>> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration 
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement 
>> timeout
>> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze 
>> _zamboni.sl_log_1
>> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, 
>> it was already committed
>> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was 
>> terminated by signal 6
>
> Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by statement_timeout, couldn't be aborted (assuming it was in fact
> 1827110275) and then the backend crashed with a signal 6 (SIGABRT). 
> SIGABRT can be caused by an assertion failure, certain fatal aborts in the C library caused by memory allocation errors, etc.
>
> Alas, while PostgreSQL may have dumped a core file I doubt there's any debug information in your build. If you do find a core file for that process ID, it might be worth checking for a debuginfo rpm just in case.
>
>> In fact those last 3 lines are repeated over and over again repeatedly 
>> until "UTC4115FATAL: the database system is in recovery mode" is 
>> logged for 4 hours. At some point, 4 hours later of course, it appears 
>> that the system recovers.
>
> Wow. Four hours recovery with default checkpoint settings.
>
> Is it possible that the server was completely overloaded and was swapping heavily? That could explain why VACUUM timed out in the first place, and would explain why it took such a long time to recover. Check your system logs around the same time for other indications of excessive load, and check your monitoring history if you have monitoring like Cacti or the like active.
>
> See if there's anything interesting in the kernel logs too.
>
> Just for completeness, can you send all non-commented-out, non-blank lines in your postgresql.conf ?
>
> $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1
>
> --
> Craig Ringer

Thanks for your help, here are the results from running that you provided to me:
$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

listen_addresses='localhost'
port=5432
max_connections=200
ssl=off
shared_buffers = 24MB
max_fsm_pages = 153600
log_line_prefix='%t%p'
statement_timeout=60000
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

And as you anticipated, no core file was dumped on that system.

It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be prone to periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather.

Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1 minute? I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there is an option I can change to help prevent a reoccurrence.

Cheers,
Matt

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general