Re: pg_dump causes postgres crash

Tom Lane <tgl@xxxxxxxxxxxxx> · Wed, 22 Aug 2007 23:55:34 -0400

Jeff Amiel <becauseimjeff@xxxxxxxxx> writes:
> Even more odd is that a LOCAL pg_dump (from on the
> box) succeeded just fine tonight (after the second
> crash).

That seems to eliminate the theory of a crash due to data corruption
... unless the corruption somehow repaired itself in the intervening
30 minutes, which hardly seems likely.

> ----First Crash-------

> backup-srv2 prod_backup # time /usr/bin/pg_dump
> --format=c --compress=9 --ignore-version
> --username=backup --host=prod_server prod > x
> pg_dump: server version: 8.2.4; pg_dump version:
> 8.0.13
> pg_dump: proceeding despite version mismatch
> pg_dump: WARNING:  terminating connection because of
> crash of another server process
> DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit,
> because another server process exited abnormally and
> possibly corrupted shared memory.

Notice that pg_dump is showing that the crash was in some OTHER server
process, not the one it was attached to.

> ------Second Crash--------

> backup-srv2 ~ # time /usr/bin/pg_dump --format=c
> --compress=9  --username=backup --host=prod_server
> prod | wc -l
> pg_dump: Dumping the contents of table "audit" failed:
> PQgetCopyData() failed.
> pg_dump: Error message from server: server closed the
> connection unexpectedly
>        This probably means the server terminated
> abnormally
>        before or while processing the request.
> pg_dump: The command was: COPY public.audit (audit_id,

This one looks more like it might have been the directly connected
server process that crashed.  However, your postmaster log from
the other message:

> From the logs tonight when the second crash occurred..
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [6-1] 2007-08-22 20:45:12 CDT   LOG: 
> received smart shutdown request
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [7-1] 2007-08-22 20:45:12 CDT   LOG: 
> server process (PID 20188) was terminated by signal 11
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [8-1] 2007-08-22 20:45:12 CDT   LOG: 
> terminating any other active server processes

raises still more questions: where the heck did the "smart shutdown
request" (that is to say, a SIGTERM interrupt to the postmaster) come
from?  It's far too much of a coincidence for that to have occurred
within a second of detecting the server process crash.

> We have introduced some new network architecture which
> is acting odd lately (dell managed switches, netscreen
> ssgs, etc) and the database itself resides on a zfs
> partition on a Pillar SAN (connected via fibre
> channel)

I can't help thinking you are looking at generalized system
instability.  Maybe someone knocked a few cables loose while
installing new network hardware?

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend