Re: Consistent file-level backup of pg data directory

Jeff Janes <jeff.janes@xxxxxxxxx> · Wed, 8 Jan 2014 10:55:25 -0800

On Wed, Jan 8, 2014 at 7:09 AM,  <gator_ml@xxxxxxxx> wrote:

On 2014-01-07 21:54, Jeff Janes wrote:

> On Tue, Jan 7, 2014 at 10:58 AM, <gator_ml@xxxxxxxx

> <mailto:gator_ml@xxxxxxxx>> wrote:

>     - keep a copy "backup_label" under a different name

[...[

> Why under a different name?  That sounds dangerous to me.

... otherwise it would be useless, because "pg_stop_backup"

will delete the "backup_label" file

That is one of the reasons you should run the backup before you tell PostgreSQL that you finished the backup :)

>     - call pg_stop_backup()

>     - the run the regular backup procedure

>

>

> You have this order backwards.  You can only tell postgres that you are

> done doing the backup once you are done doing the backup.

this is what I tried 1st - unfortunately, this won't work at all,

because postgres creates another checkpoint _after_ pg_stop and will

refuse to run the restore procedure if this last wal file is missing

(which in my scenario inevitably will be the case)

Eventually another checkpoint will be run, but pg_stop_backup will not directly cause one to happen.  What it creates is a record in the WAL stream saying that the backup has finished.  That record is what it needs to see.  After writing that record, it triggers a log switch (not the same thing as a checkpoint) so that the xlog file containing the record will get archived (if you are were truly using archiving).

That requirement is there to protect your data, you cannot safely circumvent the need for it.  If you have backed up the xlog file that contains that record before the record was present, then that will be a problem, as it won't find the record. The solution for that (other than using archiving) is to backup everything before calling pg_stop_backup, then backup just the pg_xlog after calling pg_stop_backup.

> It is pretty much a disaster.  Do you really need rsync?  Otherwise, use

> pg_basebackup, it is easier to get right.

... it looks like I didn't explain clear enough, what I am trying to

accomplish. So let me try it again:

We have a general backup procedure, that runs every night. For

some machines, there may additionally be more

specialized/fine-grained measures, but in any case, it should be

possible to restore any given server without any machine-specific

knowledge from the last nightly backup following a

generic "restore-recipe". Our current rsync-based solution is a

pretty good compromise between many different requirements.

For machines running database systems, this means, this means,

that I need some way to get a consistent state of some point in

time. It does not particularly matter, which time exactly (in

particular, I do not care, if transactions committed during the

backup are included or not) For this purpose, I can define

machine-specific hook to run before/ after the actual backup

procedure.

I think it would be easier to just exclude the database from the system-wide backup and use a different method for it, rather than engineer the necessary before/after hooks onto the system-wide backup.

Unfortunately, it does not look like there is any direct way to

accomplish this with postgres except shutting down the whole

database system while the backup is running. The systems will

be almost idle while the backup runs, and write transactions

will be very rare, but a total shutdown for every backup still

would be too radical.

Actually, with some tests copying the database directory while

writing data to the database and then restarting postgres with

such a copy, I did not manage to produce any visible

inconsistencies. If postgres was able to restore the last

consistent state in such a situation, then no additional measures

would be needed (like I said, it is not particularly important

which state this is - losing everything written during the last

15 minutes including committed transactions would be good enough)

.But at least as far as I can tell from the documentation, this

can not safely be assumed.

On small, nearly idle systems, you can often get away with doing a lot of dangerous things.  This is perhaps unfortunate, as it breeds bad habits.

Otherwise, the procedures for "Continuous Archiving" are clearly

made for some a diffent purpose, it seems like this is about as

close as it gets to what I need. As far as I understand:

- between pg_start_backup() and pg_stop_backup() it should be safe

to copy the data directory any time

- all committed transactions during this time will go to the wal files

(actually, I noticed that files in other directories than pg_xlog are

modified, too ...)

True, but the data in pg_xlog tells it how to replay those other changes to those other directories.  So it doesn't matter whether the other file was backed up pre-modification or post-modification, as the WAL allows it to be repaired either way. Unless the WAL stream ends too early--then it can't repair them.

- normally, it is intended to copy every single newly written wal file

somewhere else by whatever is defined as "archive_command", but

if this works, than any other way to copy all wal files written

after pg_start_backup should work just as well.

Kind of.  The archive_command conspires with database server to prevent it from recycling a wal file until we know it has been successfully copied by archive_command.  It is difficult for "any other way" to make that guarantee--although if the server is nearly idle, then the risk of such recycling causing problems is rather low.

My original idea was, to call pg_start_backup() 1st, then run the

actual backup followed by pg_stop_backup. Because the procedure

for " Point-in-Time Recovery" is the same and no wal file may be

modified after it was written and "archive_command" called, it

follows, that any wal file plus the preceding wal files written

after pg_start_backup must contain everything that is needed to

restore any state up to the time the wal file was

written (otherwise the whole "PITR)" mechanism would not work).

PITR allows you to recover to any point after the backup successfully ended.  It does not allow you to recover to some point in the middle of the backup (other than by starting at the previous successful backup and rolling forward).

I think that this is a logical necessity.  If you could use a backup set to restore to a point in the middle of that same backup, then you wouldn't be in the middle of the backup, you would be done with it.

Unfortunately, postgres refuses to run the restore procedure if

the last wal file (which is written after pg_stop_backup, so in

my case it will not be available) even if "recovery_target_time"

is explicitly set to some earlier time. I didn't try it, but

assuming that the last wal file is not really needed in this

case, it would be enough to just create an empty file with the

name postgres is looking for?

It is really needed if you have high standards of data integrity and safety.  That is why PostgreSQL requires it.

If I can't find, a better way, I probably could arrange to

separately copy this last wal file into the

backup (unfortunately, there is no easy way to do this ...)-:

You could just copy the entire pg_xlog directory again.  It is easier, probably safer (there is no guarantee that the "last" wal file is really only one file, rather than two or three), and on a low-traffic server the size is probably not all that large.

Cheers,

Jeff