Re: problems using pg_start_backup/pg_stop_backup and pg_basebackup at same time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Should I file a bug?

Does anyone know of a workaround?

I could try running pg_basebackup w/ --xlog-method=fetch and see if that helps but I do like knowing that the base backup will have all needed WAL files regardless of WAL retention policy on master.

Thanks.
Alex

Recap of what I *think* is going on:

1) pg_basebackup w/ --xlog-method=stream is started
2) archive_command is not executed while pg_basebackup() is running
3) while pg_basebackup() is running pg_start_backup() is executed
4) Later pg_stop_backup() is executed.
5) pg_stop_backup() hangs related to the archive_command not running
6) hours later when pg_basebackup finishes, back logged archive commands get executed then pg_stop_backup() finishes


On Mon, Jun 20, 2016 at 10:15 AM, Alex Malek <magicagent@xxxxxxxxx> wrote:

Thanks for the response Magnus.
I've replied inline below.

On Mon, Jun 20, 2016 at 4:17 AM, Magnus Hagander <magnus@xxxxxxxxxxxx> wrote:
On Mon, Jun 13, 2016 at 6:59 PM, Alex Malek <magicagent@xxxxxxxxx> wrote:

I am experiencing two problems
1) pg_stop_backup hangs until pg_basebackup finishes
2) WAL contains references to invalid pages

(I suspect when 1. is fixed I won't experience 2. )

Full description below:


postgresql: 9.3.13

My situation is I create a writable report database every day using pg_start_backup() / rsync / pg_stop_backup()
(whole process takes 1-2 hours, the rsync actually copies the data dir of a  slave/warm spare DB)

Taking what is called an exclusive backup, which is what pg_start_backup() does, is not supported off a standby node. Only pg_basebackup is supported against a standby node.

Can you reproduce your errors if you make this backup from the master?


pg_start_backup() is run on the master.
The rsync copies the data dir of the standby b/c it is local.
I have tried rsyncing from the master w/ the same results.
 

 
Also once a week I create a database for backup/archive purposes using pg_basebackup
(whole process takes about 13 hours)


These two processes used to be able to coincide until recently.
Recent changes include a major debian upgrade and a minor version of postgres upgrade from 9.3.10.

Now when the report sync occurs during the pg_basebackup (w/ --xlog-method=stream option) the pg_stop_backup() hangs until
the the pg_basebackup has completed (4 hours later).

Do you also have an archive_command, and what is it? Normally, pg_stop_backup() blocks on the archive command - so perhaps you have ended up with a dependency between that one and the base backup command somehow?


The archive command is:

archive_command = 'rsync -a %p postgres@archivedb:/storage/postgres/9.3/main-archive/pg_xlog/%f'

I thought the archive command was working properly while pg_stop_backup was hanging based on the timestamps of the files, but upon closer
inspection using stat I see that the files actually were not copied over until pg_stop_backup stopped hanging.

Actually it appears that the archive_command is not executed the entire time pg_basebackup is running.
pg_basebackup is started about 5 hours before pg_start_backup is executed.
Based on stat it looks like the wal files get written locally but do not get rsynced until pg_basebackup is done.

 
 
A labeled WAL backup file is created when the pg_stop_backup() is first executed and another
is created when the pg_basebackup completes.



While the pg_stop_backup() hangs the following appears in the logs:

2016-06-11 07:50:45 EDT: WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (7680 seconds elapsed)
2016-06-11 07:50:45 EDT: HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.


until eventually:

2016-06-11 12:59:07 EDT:LOG:  duration: 26190082.497 ms  statement: SELECT pg_stop_backup()

The archive command appears to work as WAL files are being archived successfully with timestamps that occur while pg_stop_backup is hanging.

-rw------- 1 postgres postgres 314 Jun 11 12:56 000000030000109200000066.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 12:56 000000030000109300000029
-rw------- 1 postgres postgres 16M Jun 11 12:54 000000030000109300000028
-rw------- 1 postgres postgres 16M Jun 11 12:52 000000030000109300000027
<snip>
-rw------- 1 postgres postgres 16M Jun 11 06:22 0000000300001092000000F1
-rw------- 1 postgres postgres 335 Jun 11 05:42 0000000300001092000000D4.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 05:42 0000000300001092000000F0


Content of 0000000300001092000000D4.00000028.backup:

START WAL LOCATION: 1092/D4000028 (file 0000000300001092000000D4)
STOP WAL LOCATION: 1092/F008EA10 (file 0000000300001092000000F0)
CHECKPOINT LOCATION: 1092/D4004E08
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-06-11 04:51:04 EDT
LABEL: reporting 2016-06-11 04:51:03.892804-04
STOP TIME: 2016-06-11 05:42:37 EDT

This file shows that the backup took about an hour, doesn't it? So it doesn't look like it was waiting?

Or are you saying the wait happens *after* the .backup file has been dropped in the archives?

The .backup file gets written locally in a timely manner but doesn't get dropped into the archive until after the
pg_stop_backup() stops hanging.

So it appears that pg_basebackup ( perhaps --xlog-method=stream ? ) is causing the archive_command to pause which in turn
causes pg_stop_backup() to hang?

Thanks.

Alex



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux