"Anand Kumar, Karthik" <Karthik.AnandKumar@xxxxxxxxxxxxxx> writes: > Thanks Shaun! > > Yes, we're getting synchronous_commit on right now. > > The log_min_duration was briefly set to 0 at the time I sent out the post, > just to see what statements were logged right before everything went to > hell. Didn't yield much since we very quickly realized we couldn't cope > with the volume of logs. > > We also noticed that when trying to recover from a snapshot and replay > archived wal logs, it would corrupt right away, in under an hour. When > recovering from snapshots *without* replaying wal logs, we go on for a day > or two without the problem, so it does seem like wal logs are probably not > being flushed to disk as expected. Make sure your snapshots are atomic as you probably assume they are and in fact must be if you expect a consistent cluster after startup and crash recovery. That is, if you are doing snaps at random times and not wrapping with pgstart/stop backup() *and* replaying WAL till concisconsistent recovery point. If you're snapping something like a remote-site mirror running SAN block-level replication, unless the snap is done at the end of flushing all changed blocks since last tick, then the image you're snapping may not be consistent. I say that because, I came into a company that had been doing snaps this way since eons ago and thought that since the clusters would start up and could perform trivial checks, things were OK. As soon aas you subjected an instance dirived this way however with something wide-ranging such as an all-table vac/analyze, dumpall... etc, soon after launching the foo, corruption was observed. FWIW > > Will update once we get onto the new h/w to see if that fixes it. > > Thanks, > Karthik -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@xxxxxxxxxxx p: 312.241.7800 -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general