vacuum killed because of out of memory

Geoffrey <lists@xxxxxxxxxxxxxxxxxxxxx> · Mon, 27 Aug 2007 10:13:10 -0400

I've recently reviewed the various recent threads on out of memory 
problems.  We just had a similar issue last night.  We have 11 
postmasters running on two machines in a cluster environment.  Five on 
one, six on the other.  They've been running in this manner for a little 
over a year now.

Configuration:
Quad dual-core Opertons
8 gig memory
Red Hat Advance Server 4

relevant postgresql.conf settings:

tcpip_socket = true
max_connections = 35
shared_buffers = 16000
checkpoint_segments = 10
log_min_error_statement = warning
log_connections = true
log_pid = true
log_timestamp = true

We run a 'vacuum full analyze' once a week (and I've seen a thread that 
says this should not be necessary).

Just the same, last night, while running a nightly 'vacuum full' process 
for our largest database (7.5G base), the vacuum process was killed by 
the OS because of out of memory issues.

Aug 27 00:59:07 gan-lxc-01 kernel: Out of Memory: Killed process 26169 
(postmaster).

The process 26169 does appear to correspond to the vacuum process and 
not the database postmaster process.  The postmaster process did not 
die.  We did see the following in the database log:

2007-08-27 00:59:07 [13586] LOG:  server process (PID 26169) was 
terminated by signal 9
2007-08-27 00:59:07 [13586] LOG:  terminating any other active server 
processes
2007-08-27 00:59:07 [7790] WARNING:  terminating connection because of 
crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
2007-08-27 00:59:07 [13586] LOG:  all server processes terminated; 
reinitializing
2007-08-27 00:59:07 [2031] LOG:  database system was interrupted at 
2007-08-27 00:58:59 EDT
2007-08-27 00:59:07 [2031] LOG:  checkpoint record is at 18/B3DF3B94
2007-08-27 00:59:07 [2031] LOG:  redo record is at 18/B3DF3B94; undo 
record is at 0/0; shutdown FALSE
2007-08-27 00:59:07 [2031] LOG:  next transaction ID: 63340557; next 
OID: 6459085
2007-08-27 00:59:07 [2031] LOG:  database system was not properly shut 
down; automatic recovery in progress
2007-08-27 00:59:07 [2031] LOG:  redo starts at 18/B3DF3BD4
2007-08-27 00:59:08 [2033] LOG:  connection received: 
host=198.212.166.38 port=33787
2007-08-27 00:59:08 [2033] FATAL:  the database system is starting up
2007-08-27 00:59:11 [2035] LOG:  connection received: 
host=XXX.XXX.XXX.XXX port=33788

So, my question is, based on the configuration of this box and the 
configuration of postgresql, can anyone point to anything that might 
cause this to happen?

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
 - Benjamin Franklin

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings