>> Oct 5 15:00:25 db01 postgres[76648]: [5944-1] javafail@dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system > > This message must be coming from initSelfPipe(), and after poking around > a bit I think the failure must be occurring while a new backend is > attempting to do "OwnLatch(&MyProc->procLatch)" in InitProcess. The > reason the postmaster treats this as a crash is that the new backend > just armed the dead-man switch (MarkPostmasterChildActive) but it exits > without doing ProcKill which would disarm it. So this is just an > order-of-operations bug in InitProcess: we're assuming that it can't > fail before reaching "on_shmem_exit(ProcKill, 0)", and the latch > additions broke that. (Though looking at it, assuming that the > PGSemaphoreReset call cannot fail seems a tad risky too.) > > So that explains the crashes, but it doesn't (directly) explain why you > had data corruption. I've since been able to repeat this with full_page_writes=on and have had identical corruption, so I don't think this is full_page_writes related any more. I can also confirm that there was just one crash by the master database because it required manual intervention to bring back up (a backup label was sitting stale in the data dir[1]). The slaves died during the post-crash VACUUM just as earlier. I'm completing the VACUUM now and am seeing multiple warnings from VACUUM. WARNING: relation "tbl_a" page 2115352 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115353 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115354 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115355 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115356 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115357 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115358 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115359 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115360 is uninitialized --- fixing On one of the slaves: Oct 11 00:17:36 db02 postgres[66904]: [21-1] @ 66904 0: WARNING: page 120547 of relation base/16387/20196 is uninitialized Oct 11 00:17:36 db02 postgres[66904]: [21-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed 0 Oct 11 00:17:36 db02 postgres[66904]: [22-1] @ 66904 0: PANIC: WAL contains references to invalid pages Oct 11 00:17:36 db02 postgres[66904]: [22-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed 0 Oct 11 00:17:36 db02 postgres[66897]: [10-1] @ 66897 0: LOG: startup process (PID 66904) was terminated by signal 6: Abort trap Oct 11 00:17:36 db02 postgres[66897]: [11-1] @ 66897 0: LOG: terminating any other active server processes Hopefully this is helpful information. -sc [1] The backup label was residual from the initial sync and wasn't cleaned up for some reason during a pg_stop_backup(). For now I'm chalking this up as a bug in repmgr even though repmgr completed cloning the slave successfully (supposedly). START WAL LOCATION: 9F/36000020 (file 000000010000009F00000036) CHECKPOINT LOCATION: 9F/37E9D6D8 BACKUP METHOD: pg_start_backup BACKUP FROM: master START TIME: 2012-10-06 17:48:10 UTC LABEL: repmgr_standby_clone_1349545601 -- Sean Chittenden sean@xxxxxxxxxxxxxx -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general