Joe Maldonado and I have a vexing problem with PostgreSQL 7.4.5 (we are
using the PGDG RPMs on a RedHat 9 system). He posted about this briefly
with the subject "info on strange error messages on postgresql" on this
list, but I'll start from the beginning. About once per week the
database enters some pathological state where the parent postmaster --
NOT one of the child connections -- appears to run out of memory. This
causes it to dump megabytes of diagnostic data that I don't know how to
analyze into the log. Then it does it all over again. And again and
again and... well, eventually the log grows until it fills as many
gigabytes as the disk has available.
Here is an example from the logfile:
TopMemoryContext: 87552048 total in 10683 blocks; 181280 free (60
chunks); 87370768 used
TopTransactionContext: 57344 total in 3 blocks; 648 free (5 chunks);
56696 used
DeferredTriggerXact: 0 total in 0 blocks; 0 free (0 chunks); 0 used
SPI Plan: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
SPI Plan: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
SPI Plan: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
MessageContext: 8192 total in 1 blocks; 7248 free (1 chunks); 944 used
PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
PortalHeapMemory: 3072 total in 2 blocks; 1136 free (12 chunks); 1936 used
CacheMemoryContext: 421519360 total in 60 blocks; 830096 free (457
chunks); 420689264 used
current_hostpairs2_idx: 1024 total in 1 blocks; 640 free (0 chunks); 384
used
current_hostpair2_eid_key: 2048 total in 1 blocks; 704 free (0 chunks);
1344 used
pg_temp_260223: 2048 total in 1 blocks; 768 free (0 chunks); 1280 used
pg_temp_260225: 1024 total in 1 blocks; 640 free (0 chunks); 384 used
pg_temp_260226: 1024 total in 1 blocks; 640 free (0 chunks); 384 used
pg_toast_888405987_index: 1024 total in 1 blocks; 320 free (0 chunks);
704 used
pg_toast_888405987_index: 1024 total in 1 blocks; 320 free (0 chunks);
704 used
pg_temp_260196: 1024 total in 1 blocks; 640 free (0 chunks); 384 used
pg_toast_888382258_index: 1024 total in 1 blocks; 320 free (0 chunks);
704 used
pg_toast_888382258_index: 1024 total in 1 blocks; 320 free (0 chunks);
704 used
[...]
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
DynaHashTable: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used
Eventually the following is reported in the log:
[1587] ERROR: out of memory
DETAIL: Failed on request of size 268.
A quick look at the code shows that this comes from the AllocSetAlloc()
routine at line 502 of src/backend/utils/mmgr/aset.c from the 7.4.5
release. This doesn't tell me much about what happened beyond the fact
that malloc() failed. I see no evidence in /var/log/messages to
indicate a system wide memory exhaustion and a periodic sampling of the
amount of available swap space shows that there is always at least a
gigabyte of that available. This suggests that a postmaster process has
simply run out of address space somehow.
In each case the process identifier is the same (1587 in this instance).
Other processes are unable to establish new connections, but the log
shows that existing connections continue to issue queries. This leads
us to believe that it is the parent process which is running out of
memory rather than any child forked for a specific connection. Is this
possible? What might lead to such a case?
There is some history here. We originally discovered that the database
would develop long term problems because system tables weren't being
vacuumed properly so we decided to deply pg_autovacuum. This didn't
work out because we have temporary tables that get periodically deleted.
Every so often pg_autovacuum would notice that a table was missing and
throw up its hands in despair. We attempted to address this by
implementing a wrapper that restarted it whenever it terminates.
After this change the problem described above began. We have
subsequently added a delay to the wrapper that doubles in duration each
time pg_autovacuum crashes to avoid hammering the system when something
is wrong. That might make the problem less catestrophic, but of course
it won't really solve it. After all, the memory problem is in the
parent postmaster, not pg_autovacuum.
Why should ordinary vacuuming run the system out of memory? What should
we do to make things work reliably?
Jeff
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq