Re: How to repair Pg 9.1?

Tomas Vondra <tv@xxxxxxxx> · Sun, 15 Apr 2012 15:35:28 +0200

On 15.4.2012 14:52, Dmitry E. Oboukhov wrote:
> Recently I started a hard query (ALTER TABLE) and then Postgresql was
> killed by OOM killer.
> 
> Since that clients lose their connection and Postgresql says in log:
> 
>     2012-04-15 16:46:02 MSK LOG:  server process (PID 825) was terminated by signal 11: Segmentation fault
>     2012-04-15 16:46:02 MSK LOG:  terminating any other active server processes
>     2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
>     2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>     2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your command.
>     2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
>     2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>     2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your command.
>     2012-04-15 16:46:02 MSK WARNING:  terminating connection because of crash of another server process
>     2012-04-15 16:46:02 MSK DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>     2012-04-15 16:46:02 MSK HINT:  In a moment you should be able to reconnect to the database and repeat your command.
>     2012-04-15 16:46:02 MSK LOG:  all server processes terminated; reinitializing
>     2012-04-15 16:46:02 MSK LOG:  database system was interrupted; last known up at 2012-04-15 16:45:17 MSK
>     2012-04-15 16:46:02 MSK LOG:  database system was not properly shut down; automatic recovery in progress
>     2012-04-15 16:46:02 MSK LOG:  consistent recovery state reached at 58/21C44D18
>     2012-04-15 16:46:02 MSK LOG:  redo starts at 58/21C44D18
>     2012-04-15 16:46:02 MSK LOG:  record with zero length at 58/21CA7E20
>     2012-04-15 16:46:02 MSK LOG:  redo done at 58/21CA7DE0
>     2012-04-15 16:46:02 MSK LOG:  last completed transaction was at log time 2012-04-15 16:46:02.620188+04
>     2012-04-15 16:46:02 MSK LOG:  autovacuum launcher started
>     2012-04-15 16:46:02 MSK LOG:  database system is ready to accept connections
> 
> 
> What can I do in the situation?

Well, what happened is quite clear. One of the processes was terminated
by the OOM killer. The database noticed this and responded properly by
restarting the database. That caused a proper recovery - you don't need
to "repair" the database.

What you need to do is change the machine configuration so that the OOM
errors do not happen. I don't know what's running on the machine, but
this usually means tweaking the vm.* parameters at the kernel level and
shared_buffers/work_mem/maintenance_work_mem at the PostgreSQL level.

In this particular situation (ALTER TABLE) I'd use a low value for the
maintenance_work_mem parameter - it'll run longer but won't need that
much memory.

Tomas

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general