Jeff Amiel <becauseimjeff@xxxxxxxxx> writes: > Even more odd is that a LOCAL pg_dump (from on the > box) succeeded just fine tonight (after the second > crash). That seems to eliminate the theory of a crash due to data corruption ... unless the corruption somehow repaired itself in the intervening 30 minutes, which hardly seems likely. > ----First Crash------- > backup-srv2 prod_backup # time /usr/bin/pg_dump > --format=c --compress=9 --ignore-version > --username=backup --host=prod_server prod > x > pg_dump: server version: 8.2.4; pg_dump version: > 8.0.13 > pg_dump: proceeding despite version mismatch > pg_dump: WARNING: terminating connection because of > crash of another server process > DETAIL: The postmaster has commanded this server > process to roll back the current transaction and exit, > because another server process exited abnormally and > possibly corrupted shared memory. Notice that pg_dump is showing that the crash was in some OTHER server process, not the one it was attached to. > ------Second Crash-------- > backup-srv2 ~ # time /usr/bin/pg_dump --format=c > --compress=9 --username=backup --host=prod_server > prod | wc -l > pg_dump: Dumping the contents of table "audit" failed: > PQgetCopyData() failed. > pg_dump: Error message from server: server closed the > connection unexpectedly > This probably means the server terminated > abnormally > before or while processing the request. > pg_dump: The command was: COPY public.audit (audit_id, This one looks more like it might have been the directly connected server process that crashed. However, your postmaster log from the other message: > From the logs tonight when the second crash occurred.. > Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848 > local0.info] [6-1] 2007-08-22 20:45:12 CDT LOG: > received smart shutdown request > Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848 > local0.info] [7-1] 2007-08-22 20:45:12 CDT LOG: > server process (PID 20188) was terminated by signal 11 > Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848 > local0.info] [8-1] 2007-08-22 20:45:12 CDT LOG: > terminating any other active server processes raises still more questions: where the heck did the "smart shutdown request" (that is to say, a SIGTERM interrupt to the postmaster) come from? It's far too much of a coincidence for that to have occurred within a second of detecting the server process crash. > We have introduced some new network architecture which > is acting odd lately (dell managed switches, netscreen > ssgs, etc) and the database itself resides on a zfs > partition on a Pillar SAN (connected via fibre > channel) I can't help thinking you are looking at generalized system instability. Maybe someone knocked a few cables loose while installing new network hardware? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend