Search Postgresql Archives

Out of memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a client that experienced several Out Of Memory errors a few weeks ago (March 10 & 11), and I'd like to figure out the cause. In the logs it's showing that they were getting out of memory errors for about 0.5-1 hour, after which one of the processes would crash and take the whole database down. After they restarted the server it would eventually start giving out of memory messages and crash again. This happened a total of five times over a 24 hour period. After that we did not see these errors again. They did upgrade to 8.1.11 on the 14th, and have also moved some of the databases to different servers afterwards.

First some background information:

Software (at the time of the memory errors): CentOS 4.5 (x86_64) running its 2.6.9-55.ELsmp Linux kernel, PostgreSQL 8.1.9 (from RPMs provided on the PostgreSQL web site: postgresql-8.1.9-1PGDG.x86_64).

Hardware: 4 dual-core Opterons.  16GB physical RAM, 2GB swap.

Database: they use persistent connections, and usually have around 1000 open database connections. The vast majority of those are usually idle. They do run a lot of queries though. The total size of the databases in this cluster is 36GB, with the largest database being 21GB, and the largest table being 2.5GB (having 20 million tuples).

Highlights of postgresql.conf settings:
max_connections = 2000
shared_buffers = 120000
work_mem = 4096
maintenance_work_mem = 268435
max_fsm_pages = 1500000
max_fsm_relations = 2000
checkpoint_segments = 8
effective_cache_size = 1400000
stats_start_collector = on
stats_command_string = off
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = off
autovacuum = on
lc_* = 'en_US.UTF-8'
Note that not all of these settings may have been in effect during the first crash since the server hadn't been restarted for quite some time. However, during the second and third crash these settings were in effect.

ulimit settings:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 137216
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Miscellaneous facts:
Before the crashes there were a lot of "duplicate key violates unique constraint ..." errors as well as some syntax errors in some of the queries. A lot of these errors were fixed after the crashes. I don't think these would be related to the problem unless there's a memory leak somewhere.

This server does not have any other processes running that could consume a significant amount of memory. Some of the other processes that run are SSH, cupsd, snmpd, sendmail, portsentry, xinetd, and Dell OpenManage. This server has no cronjobs.

These errors usually seemed to occur right after autovacuum started vacuuming a database, although I'm not sure if autovacuum is the cause here.

Here are some of the relevant portions from the logs (some of the database/relation names are obfuscated to protect confidentiality): 2008-03-10 18:42:58 EDT LOG: autovacuum: processing database "database1" [this database is 4GB in size] TopMemoryContext: 401632 total in 11 blocks; 12832 free (12 chunks); 388800 used TopTransactionContext: 8192 total in 1 blocks; 7352 free (11 chunks); 840 used Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used Autovacuum context: 253952 total in 5 blocks; 131152 free (147 chunks); 122800 used smgr relation table: 24576 total in 2 blocks; 15968 free (4 chunks); 8608 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 24576 total in 2 blocks; 7632 free (3 chunks); 16944 used
CacheMemoryContext: 2157568 total in 10 blocks; 941176 free (0 chunks); 1216392 used some_pkey: 1024 total in 1 blocks; 312 free (0 chunks); 712 used [note: this index is currently about 41MB but only has 129 small tuples - it's currently very bloated] pg_index_indrelid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_constraint_conrelid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_autovacuum_vacrelid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_database_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_type_typname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_type_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_statistic_relid_att_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_auth_members_member_role_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_auth_members_role_member_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used
pg_proc_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_operator_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used pg_namespace_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_namespace_nspname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_language_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_language_name_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_index_indexrelid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_authid_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_authid_rolname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_database_datname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_conversion_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_conversion_name_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_class_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_cast_source_target_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_amproc_opc_proc_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used
pg_amop_opr_opc_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_amop_opc_strat_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used pg_aggregate_fnoid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used Per-database table: 57344 total in 3 blocks; 26096 free (10 chunks); 31248 used Per-database table: 57344 total in 3 blocks; 34320 free (10 chunks); 23024 used Per-database table: 24576 total in 2 blocks; 5696 free (4 chunks); 18880 used Per-database table: 122880 total in 4 blocks; 9344 free (13 chunks); 113536 used Per-database table: 122880 total in 4 blocks; 21680 free (13 chunks); 101200 used Per-database table: 57344 total in 3 blocks; 34320 free (10 chunks); 23024 used Per-database table: 24576 total in 2 blocks; 5696 free (4 chunks); 18880 used Per-database table: 122880 total in 4 blocks; 23744 free (14 chunks); 99136 used Per-database table: 122880 total in 4 blocks; 21680 free (13 chunks); 101200 used Per-database table: 24576 total in 2 blocks; 13920 free (4 chunks); 10656 used Per-database table: 122880 total in 4 blocks; 21680 free (13 chunks); 101200 used
Databases hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
MdSmgr: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
LockTable (locallock hash): 24576 total in 2 blocks; 15984 free (5 chunks); 8592 used
Timezones: 52560 total in 2 blocks; 3744 free (0 chunks); 48816 used
Postmaster: 2328624 total in 3 blocks; 18048 free (86 chunks); 2310576 used
ErrorContext: 8192 total in 1 blocks; 8160 free (3 chunks); 32 used
2008-03-10 18:42:58 EDT ERROR:  out of memory
2008-03-10 18:42:58 EDT DETAIL:  Failed on request of size 274877436.
TopMemoryContext: 458976 total in 18 blocks; 12528 free (53 chunks); 446448 used TopTransactionContext: 8192 total in 1 blocks; 7648 free (0 chunks); 544 used
SPI Plan: 15360 total in 4 blocks; 584 free (0 chunks); 14776 used
SPI Plan: 7168 total in 3 blocks; 952 free (0 chunks); 6216 used
SPI Plan: 3072 total in 2 blocks; 1112 free (0 chunks); 1960 used
SPI Plan: 3072 total in 2 blocks; 1112 free (0 chunks); 1960 used
SPI Plan: 7168 total in 3 blocks; 952 free (0 chunks); 6216 used
SPI Plan: 3072 total in 2 blocks; 1576 free (0 chunks); 1496 used
SPI Plan: 3072 total in 2 blocks; 1576 free (0 chunks); 1496 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 1024 total in 1 blocks; 72 free (0 chunks); 952 used
PL/PgSQL function context: 24576 total in 2 blocks; 15096 free (4 chunks); 9480 used
SPI Plan: 15360 total in 4 blocks; 776 free (0 chunks); 14584 used
SPI Plan: 3072 total in 2 blocks; 1112 free (0 chunks); 1960 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
SPI Plan: 3072 total in 2 blocks; 1192 free (0 chunks); 1880 used
PL/PgSQL function context: 24576 total in 2 blocks; 1024 free (4 chunks); 23552 used
SPI Plan: 3072 total in 2 blocks; 1448 free (0 chunks); 1624 used
SPI Plan: 7168 total in 3 blocks; 192 free (0 chunks); 6976 used
PL/PgSQL function context: 8192 total in 1 blocks; 6016 free (3 chunks); 2176 used
SPI Plan: 3072 total in 2 blocks; 1064 free (0 chunks); 2008 used
SPI Plan: 3072 total in 2 blocks; 1208 free (0 chunks); 1864 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 3072 total in 2 blocks; 1688 free (0 chunks); 1384 used
SPI Plan: 7168 total in 3 blocks; 1952 free (0 chunks); 5216 used
SPI Plan: 3072 total in 2 blocks; 1064 free (0 chunks); 2008 used
SPI Plan: 3072 total in 2 blocks; 1208 free (0 chunks); 1864 used
SPI Plan: 3072 total in 2 blocks; 1208 free (0 chunks); 1864 used
SPI Plan: 3072 total in 2 blocks; 1664 free (0 chunks); 1408 used
PL/PgSQL function context: 24576 total in 2 blocks; 13080 free (8 chunks); 11496 used
SPI Plan: 7168 total in 3 blocks; 4056 free (0 chunks); 3112 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 3072 total in 2 blocks; 936 free (0 chunks); 2136 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 3072 total in 2 blocks; 952 free (0 chunks); 2120 used
SPI Plan: 7168 total in 3 blocks; 3864 free (0 chunks); 3304 used
SPI Plan: 3072 total in 2 blocks; 936 free (0 chunks); 2136 used
PL/PgSQL function context: 8192 total in 1 blocks; 896 free (4 chunks); 7296 used Type information cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
SPI Plan: 3072 total in 2 blocks; 352 free (0 chunks); 2720 used
PL/PgSQL function context: 8192 total in 1 blocks; 3552 free (6 chunks); 4640 used PLpgSQL function cache: 24224 total in 2 blocks; 3744 free (0 chunks); 20480 used Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used MessageContext: 1063832 total in 8 blocks; 19088 free (6 chunks); 1044744 used
CFuncHash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
smgr relation table: 24576 total in 2 blocks; 3584 free (4 chunks); 20992 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 8192 total in 1 blocks; 8160 free (1 chunks); 32 used
Relcache by OID: 24576 total in 2 blocks; 8672 free (3 chunks); 15904 used
CacheMemoryContext: 4267376 total in 11 blocks; 1835032 free (1 chunks); 2432344 used [139 user indexes skipped - the biggest item was:187392 total in 25 blocks; 8024 free (0 chunks); 179368 used] pg_index_indrelid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_constraint_conrelid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_type_typname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_type_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_statistic_relid_att_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_auth_members_member_role_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_auth_members_role_member_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used
pg_proc_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_operator_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used pg_namespace_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_namespace_nspname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_language_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_language_name_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_index_indexrelid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_authid_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_authid_rolname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_database_datname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_conversion_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_conversion_name_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_class_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
pg_cast_source_target_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used pg_amproc_opc_proc_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used
pg_amop_opr_opc_index: 1024 total in 1 blocks; 216 free (0 chunks); 808 used
pg_amop_opc_strat_index: 1024 total in 1 blocks; 208 free (0 chunks); 816 used pg_aggregate_fnoid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
MdSmgr: 8192 total in 1 blocks; 2208 free (0 chunks); 5984 used
LockTable (locallock hash): 24576 total in 2 blocks; 15984 free (5 chunks); 8592 used
Timezones: 52560 total in 2 blocks; 3744 free (0 chunks); 48816 used
ErrorContext: 8192 total in 1 blocks; 8160 free (3 chunks); 32 used
2008-03-10 18:43:30 EDT ERROR:  out of memory
2008-03-10 18:43:30 EDT DETAIL:  Failed on request of size 16.
[A bunch more memory errors skipped]
2008-03-10 18:43:33 EDT LOG: could not fork new process for connection: Cannot allocate memory 2008-03-10 18:43:33 EDT LOG: could not fork new process for connection: Cannot allocate memory 2008-03-10 18:43:33 EDT LOG: could not fork new process for connection: Cannot allocate memory 2008-03-10 18:43:33 EDT LOG: could not fork new process for connection: Cannot allocate memory
[...]
2008-03-10 18:51:08 EDT LOG:  archive command "..." failed: return code -1
[...]
2008-03-10 19:20:39 EDT ERROR:  out of memory
2008-03-10 19:20:39 EDT DETAIL:  Failed on request of size 32.
2008-03-10 19:20:39 EDT LOG: server process (PID 5609) was terminated by signal 11
2008-03-10 19:20:39 EDT LOG:  terminating any other active server processes
2008-03-10 19:20:39 EDT WARNING: terminating connection because of crash of another server process 2008-03-10 19:20:39 EDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-03-10 19:20:39 EDT HINT: In a moment you should be able to reconnect to the database and repeat your command.

I also saw the following lines in the kernel logs during this 24 hour period: Mar 10 19:20:39 servername kernel: postmaster[5609]: segfault at 0000007fbffedfe8 rip 000000000049ea35 rsp 0000007fbffedff0 error 6 Mar 10 20:44:15 servername kernel: postmaster[26808]: segfault at 0000007fbffeafe8 rip 000000000049ea00 rsp 0000007fbffeb018 error 6 Mar 11 10:10:59 servername kernel: postmaster[19984]: segfault at fffffffffffffff0 rip 00000000005e7b13 rsp 0000007fbfffe178 error 4 Mar 11 11:52:07 servername kernel: postmaster[27309]: segfault at 0000007fbffedfe8 rip 000000000049ea00 rsp 0000007fbffee018 error 6 Mar 11 12:39:18 servername kernel: postmaster[30301]: segfault at fffffffffffffff0 rip 00000000005e7b13 rsp 0000007fbfffe158 error 4

Each of these five segfaults correspond to when the database crashed.

vm.overcommit_memory was set to 0, but I do not see any memory errors in the kernel logs so I don't believe the OOM killer was used.

Current output of free:
            total       used       free     shared    buffers     cached
Mem:      16479276   16441000      38276          0      26612    8633316
-/+ buffers/cache:    7781072    8698204
Swap:      2031608       1180    2030428

The client is using Zenoss to monitor the servers, and it always shows that free memory is pretty low, which tells me that this setup doesn't differentiate memory that's used by the OS buffers/cache, which is unfortunate. I did see free memory going up to about 1GB around each time PostgreSQL crashed, presumably because PostgreSQL got killed and released its memory. Free swap remained at 2GB free before, during, and after the crashes. I did not see any abnormal changes in the loads and CPU utilization during the crashes. Note that these measurements were taking once every other hour so it's not very granular.

Any ideas as to what could have caused this?

Thank you,

Alex


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux