Search Postgresql Archives

NFS vs. PostgreSQL on Solaris

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've got an interesting case study that I don't fully understand from a postgres perspective, and I'm hoping that someone in the community might help me understand what led to the outcome and whether it's easily prevented or not.

The setup: PostgreSQL 8.2.3 on Solaris 10 x86

postgresql.conf:
shared_buffers = 200000

This had been a box that had been completely dedicated to postgres. A new project required that the box start to also share a directory over NFS. In preparation for this, nfsd had been running for quite some time, although today was the first day that production mount points on other boxes were added. The first external mount point was created this morning, and all seemed quiet. The second external mount point was then added this afternoon. That second mount point caused postgres great consternation.

Within several few seconds of enabling the second external NFS mount point, postgres began reporting out of memory errors. An INSERT generated the following:

TopMemoryContext: 118832 total in 10 blocks; 17512 free (24 chunks); 101320 used
TopTransactionContext: 8192 total in 1 blocks; 7856 free (0 chunks); 336 used
Type information cache: 8192 total in 1 blocks; 1800 free (0 chunks); 6392 used
Operator class cache: 8192 total in 1 blocks; 4872 free (0 chunks); 3320 used
MessageContext: 262144 total in 6 blocks; 75208 free (5 chunks); 186936 used
smgr relation table: 8192 total in 1 blocks; 1776 free (0 chunks); 6416 used
TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0 chunks); 16 used
Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
PortalHeapMemory: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
ExecutorState: 147648 total in 7 blocks; 59816 free (11 chunks); 87832 used
HashTableContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
HashBatchContext: 25174100 total in 13 blocks; 7754600 free (16 chunks); 17419500 used
TupleSort: 4448280 total in 10 blocks; 1888 free (25 chunks); 4446392 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Unique: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
AggContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 8192 total in 1 blocks; 2856 free (0 chunks); 5336 used
CacheMemoryContext: 659000 total in 19 blocks; 140248 free (1 chunks); 518752 us
ed

This was followed by a number of statements about individual relations and, eventually, this:

MdSmgr: 8192 total in 1 blocks; 7120 free (0 chunks); 1072 used
LOCALLOCK hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
Timezones: 48616 total in 2 blocks; 5968 free (0 chunks); 42648 used
ErrorContext: 8192 total in 1 blocks; 8176 free (4 chunks); 16 used
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):ERROR:  out of memory
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):DETAIL:  Failed on request of size 257.

This happened intermittently for the next several minutes, eventually building to FATAL:

2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):FATAL:  out of memory
2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):DETAIL:  Failed on request of size 1864.

The FATAL error was preceded by a similar report to the original ERRORs. I don't know what distinguishes the ERROR "out of memory" from the FATAL version.

Then, eventually, without any evidence of other things on the box behaving badly, we saw this:

2007-04-26 14:53:05 CDT 15175 :LOG:  server process (PID 18543) was terminated by signal 11
2007-04-26 14:53:05 CDT 15175 :LOG:  terminating any other active server processes

So my questions are:

1. What aspect of postgres' memory usage would create an "out of memory" condition?
2. What is the difference between an ERROR and FATAL "out of memory" message?
3. What would cause postgres to die from a signal 11?

I've also got a core file if that's necessary for further forensics.

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

615-260-0005

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux