Re: Using pg_start_backup() and pg_stop_backup()

David B Harris <dbharris@xxxxxxxxxxxxx> · Tue, 16 Jul 2013 20:35:49 -0400

On Wed Jul 17, 09:16am +0900, Michael Paquier wrote:
> On Wed, Jul 17, 2013 at 8:24 AM, David B Harris <dbharris@xxxxxxxxxxxxx> wrote:
> > On Wed Jul 17, 08:12am +0900, Michael Paquier wrote:
> >      4. In parallel, WAL archiving has copied all the logs from the
> >         start of pg_start_backup() through to the end of pg_stop_backup()
> Yes exactly. This avoids to have to maintain your own scripts and rely
> on the server features... What do you actually do to copy the
> necessary WAL files. Do you fetch them directly from the master's
> pg_xlog folder?

Due to bandwidth constraints separate scripts are going to need to be
maintained anyways. I'd love to use pg_basebackup, but since it
transfers the entire cluster each time, it's not suitable - something
like rsync is going to be used instead. Actually the use of
pg_basebackup would require a separate script anyway, so regardless. If
the WAL files do need to be copied, I'll be copying them separately
yeah.

(I'll use archive_command that only runs when backup_label is in place
[pg_startup() has been called], put them somewhere safe during the
cluster rsync [probably outside of the cluster directory], transfer them
after pg_stop_backup(), and then delete them on the source/master
server.)

> > A configuration like this is provided like this in the documentation
> > (again section 24.3), though some assembly is required. I'm hoping to
> > avoid it since it'll be introducing a second channel to the backup which
> > I can almost guarantee will be forgotten in time. (BTW, we tend to
> > consider the lifespan of our installations in terms of decades.)
> >
> > I mean, if it's necessary, the so be it of course, but the question
> > still remains: I've tested a fair bit under fairly adverse conditions
> > and not had a single failure. Luck?
> 
> It looks so, and wal_keep_segments is set to a value high enough on
> the master side so as all the necessary WAL files are kept intact
> somewhere even if they are not archived.

I'm going to re-run my tests again, but here's what I did:

    1. Set wal_keep_segments to a small value (I think I used 8)
    2. Create a new database with a small canary table
    2. pg_start_backup()
    3. Run a bunch of database drops/loads and table drops/loads (with
       pg_restore and COPY .. FROM, respectively) ... for hours and
       hours (writing tens of GBs of data, without question cycling
       through wal_keep_segments)
    4. Take a tarball of the cluster (including pg_xlog/)
    5. pg_stop_backup()
    6. Restore from tarball, start cluster
    7. Check for canary table (successfully)

Separately, I also prototyped the backup and I've run it dozens of times
occasionally with a heavy write load on the source/master server without
errors. Now the write load was likely never enough to saturate all
allocatable WAL segments, but given the documentation (which says that
the _last_ WAL segment which is synced by pg_stop_backup() must be
copied as well), I'd have expected a consistency error or similar.

Since somebody has now said "no you're crazy" (paraphrased :), I'm going
to double-check again.

I didn't mention in the original email, but of course one possibility is
that the documentation might be incomplete simply in that if
wal_keep_segments is set high enough such that the earliest segment
isn't overwritten while the rsync is running, everything would be fine.
But (again if I'm reading the documentation correctly), I think there
should be consistency errors (again unless I'm just lucky). Maybe if
there's an incomplete WAL transaction/segment/whatever, it just silently
ignores it. (This would of course make sense.)

I'll run the tests again and this time not use a canary table. It occurs
to me that what I might be seeing is data files in the cluster not being
deleted. If I'm dropping databases and/or tables, my base backup may be
getting the old ones but the new ones might be inconsistent/broken.
(Though I'd still expect a "cannot replay log" error of some sort at
cluster startup.)

-- 
     Arguing with an engineer is like wrestling with a pig in mud.
           After a while, you realise the pig is enjoying it.

                   OpenPGP v4 key ID: 4096R/59DDCB9F
    Fingerprint: CC53 F124 35C0 7BC2 58FE  7A3C 157D DFD9 59DD CB9F
                     Retrieve from subkeys.pgp.net

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general