Looks like you've got some form of coruption: page 1441792 of relation base/63229/63370 does not exist The question is whether it was corrupted on the master and then replicated to the slave, or if it was corrupted on the slave. I'd guess that the pg_dump tried to read from that page and barfed. It would be interesting to try re-running the pg_dump again to see if this crash can be replicated. If so, does it also replicate if you run pg_dump against the master? If not, then the corruption is isolated to the slave, and you might have a hardware problem which is causing the data to get corrupted. On Fri, Mar 29, 2013 at 9:19 AM, Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> wrote: > Yesterday morning, one of my streaming replication slaves running 9.2.3 > crashed with the following in the log file: > > 2013-03-28 12:49:30 GMT WARNING: page 1441792 of relation base/63229/63370 > does not exist > 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; > iblk 303, heap 1663/63229/63370; > 2013-03-28 12:49:30 GMT PANIC: WAL contains references to invalid pages > 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; > iblk 303, heap 1663/63229/63370; > 2013-03-28 12:49:31 GMT LOG: startup process (PID 22941) was terminated by > signal 6: Aborted > 2013-03-28 12:49:31 GMT LOG: terminating any other active server processes > 2013-03-28 12:49:31 GMT WARNING: terminating connection because of crash of > another server process > 2013-03-28 12:49:31 GMT DETAIL: The postmaster has commanded this server > process to roll back the current transaction and exit, because another > server process exited abnormally and possibly corrupted shared memory. > 2013-03-28 12:49:31 GMT HINT: In a moment you should be able to reconnect > to the database and repeat your command. > 2013-03-28 12:57:44 GMT LOG: database system was interrupted while in > recovery at log time 2013-03-28 12:37:42 GMT > 2013-03-28 12:57:44 GMT HINT: If this has occurred more than once some data > might be corrupted and you might need to choose an earlier recovery target. > 2013-03-28 12:57:44 GMT LOG: entering standby mode > 2013-03-28 12:57:44 GMT LOG: redo starts at 19/2367CE30 > 2013-03-28 12:57:44 GMT LOG: incomplete startup packet > 2013-03-28 12:57:44 GMT LOG: consistent recovery state reached at > 19/241835B0 > 2013-03-28 12:57:44 GMT LOG: database system is ready to accept read only > connections > 2013-03-28 12:57:44 GMT LOG: invalid record length at 19/2419EE38 > 2013-03-28 12:57:44 GMT LOG: streaming replication successfully connected > to primary > > As you can see I was able to restart it and it picked up and synchronized > right away, but this crash still concerns me. > > The DB has about 75GB of data in it, and it is almost entirely write > traffic. It's essentially a log aggregator. I believe it was doing a pg_dump > backup at the time of the crash. It has hot_standby_feedback on to allow > that process to complete. > > Any insights into this, or advice on figuring out the root of it would be > appreciated. So far all the things I've found like this are bugs that should > be fixed in this version, or the internet equivalent of a shrug. > > Thanks! > > QH -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@xxxxxxxxx LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general