On Jul 31, 2013, at 7:13, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > Jeff Janes <jeff.janes@xxxxxxxxx> writes: >> On Tuesday, July 30, 2013, James Sewell wrote: >>> I understand what you are saying, and I understand how the backup_label >>> works - but I still don't understand why the pg_start and pg_stop commands >>> are REQUIRED when doing a snapshot backup to ensure data integrity. >>> >>> Surely not using them and restoring a snapshot is the same as starting >>> after a crash, and will result in log replay to get to the latest possible >>> consistent state? > >> That's true provided that all of your data is on a single volume, and you >> trust your "snapshot" to be free of bugs. > > James stated to begin with that his data was spread across multiple > volumes, so it's definitely not safe for him to omit > pg_start_backup/pg_stop_backup. (…) > For recovery from a tar backup to work, the archived copy of pg_control > must point to a spot in the WAL sequence that is before any changes that > could possibly not yet appear in any archived data files. The purpose of > pg_start_backup/pg_stop_backup is to provide the synchronization needed to > meet this requirement. > > The comparable case isn't possible for crash recovery, assuming that the > OS and storage hardware implement fsync() correctly, because we'll have > fsync'd the changes to foo down to disk before updating pg_control. > > Now, if you instead take a filesystem snapshot (representing some > instantaneous state of the disk contents) and run "tar" to copy that, > you have a good backup, because you must have a copy of pg_control that > will tell you to re-apply any changes that are missing from the data > files, as well as WAL files that contain the needed records. However, > this is only certain if all that data is on *one* filesystem, because > otherwise you can't be sure you have mutually consistent snapshots. > And you're vulnerable to any bugs in the filesystem's snapshot > implementation that might give you inconsistent copies of different > files. (Such bugs would probably be closely related to bugs in fsync > ... but that doesn't mean they're necessarily exactly the same.) That begs the question what happens in case of a crash or (worse) a partial crash when multiple file systems are involved. Say, one tablespace is on a ZFS volume and one is on an UFS volume and the ZFS code crashes. That should result in a kernel panic, of course, in which case every file-system is stopped at the same moment and there should be a consistent "snapshot". But what if it doesn't trigger a panic? And how would file-system recovery play into this? Would a journaled file-system roll back every segment involved in the same interrupted transaction, or could this create an inconsistent snapshot where the original situation (sans file-system recovery) would actually have been a preferable state to start from? And what happens if one of these volumes would, for example, get ejected from a RAID controller (because all disks in it have triggered alarms, whether genuine or not) and the other volume would not get ejected? Does the database abort work or does that result in an inconsistent state? I suppose this case usually doesn't matter much, part of the database is gone completely anyway, but what if the RAID controller was wrong and the disks are actually just fine and come back once re-inserted? That has actually happened several times here, caused by SATA wires vibrating loose over time (took me a while to discover the cause); my databases have always been on a single volume though, so I've never had the opportunity to run into this. I realise these are typical "what if" scenario's, so I suppose answering these doesn't have a high priority. It's just that I've been wondering/worrying about the seeming increase of people reporting database corruption - I was just wondering whether issues like these might play a part in that (I'm sure large part of it is just more people using PG these days). Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general