Re: problems using pg_start_backup/pg_stop_backup and pg_basebackup at same time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Jun 27, 2016 at 8:27 PM, Alex Malek <magicagent@xxxxxxxxx> wrote:
Should I file a bug?

Oh, sorry. I did not spot that this was posted to pgsql-admin, which has a lot fewer readers. I've moved it over to -hackers where more people will see it.

(And also, please don't top-post on these lists, as it makes the discussion much harder to follow)

 

Does anyone know of a workaround?

I could try running pg_basebackup w/ --xlog-method=fetch and see if that helps but I do like knowing that the base backup will have all needed WAL files regardless of WAL retention policy on master.

Thanks.
Alex

Recap of what I *think* is going on:

1) pg_basebackup w/ --xlog-method=stream is started
2) archive_command is not executed while pg_basebackup() is running

There is no reason why archive_command should not run during a pg_basebackup run. Do you get anything at all in the logfile indicating that this would be the case? If you look at the ps output, does the archiver process indicate what it's doing, and does it have a sub-process for your rsync command?

Is there any chance that for example the network or other part of the system gets saturated and the archive_command is simply runniung too slow to keep up?


3) while pg_basebackup() is running pg_start_backup() is executed
4) Later pg_stop_backup() is executed.
5) pg_stop_backup() hangs related to the archive_command not running
6) hours later when pg_basebackup finishes, back logged archive commands get executed then pg_stop_backup() finishes

This would be consistent with (2), but there is no reason why (2) would happen because of pg_basebackup unless there's an actual bug there.

//Magnus

 
On Mon, Jun 20, 2016 at 10:15 AM, Alex Malek <magicagent@xxxxxxxxx> wrote:

Thanks for the response Magnus.
I've replied inline below.

On Mon, Jun 20, 2016 at 4:17 AM, Magnus Hagander <magnus@xxxxxxxxxxxx> wrote:
On Mon, Jun 13, 2016 at 6:59 PM, Alex Malek <magicagent@xxxxxxxxx> wrote:

I am experiencing two problems
1) pg_stop_backup hangs until pg_basebackup finishes
2) WAL contains references to invalid pages

(I suspect when 1. is fixed I won't experience 2. )

Full description below:


postgresql: 9.3.13

My situation is I create a writable report database every day using pg_start_backup() / rsync / pg_stop_backup()
(whole process takes 1-2 hours, the rsync actually copies the data dir of a  slave/warm spare DB)

Taking what is called an exclusive backup, which is what pg_start_backup() does, is not supported off a standby node. Only pg_basebackup is supported against a standby node.

Can you reproduce your errors if you make this backup from the master?


pg_start_backup() is run on the master.
The rsync copies the data dir of the standby b/c it is local.
I have tried rsyncing from the master w/ the same results.
 

 
Also once a week I create a database for backup/archive purposes using pg_basebackup
(whole process takes about 13 hours)


These two processes used to be able to coincide until recently.
Recent changes include a major debian upgrade and a minor version of postgres upgrade from 9.3.10.

Now when the report sync occurs during the pg_basebackup (w/ --xlog-method=stream option) the pg_stop_backup() hangs until
the the pg_basebackup has completed (4 hours later).

Do you also have an archive_command, and what is it? Normally, pg_stop_backup() blocks on the archive command - so perhaps you have ended up with a dependency between that one and the base backup command somehow?


The archive command is:

archive_command = 'rsync -a %p postgres@archivedb:/storage/postgres/9.3/main-archive/pg_xlog/%f'

I thought the archive command was working properly while pg_stop_backup was hanging based on the timestamps of the files, but upon closer
inspection using stat I see that the files actually were not copied over until pg_stop_backup stopped hanging.

Actually it appears that the archive_command is not executed the entire time pg_basebackup is running.
pg_basebackup is started about 5 hours before pg_start_backup is executed.
Based on stat it looks like the wal files get written locally but do not get rsynced until pg_basebackup is done.

 
 
A labeled WAL backup file is created when the pg_stop_backup() is first executed and another
is created when the pg_basebackup completes.



While the pg_stop_backup() hangs the following appears in the logs:

2016-06-11 07:50:45 EDT: WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (7680 seconds elapsed)
2016-06-11 07:50:45 EDT: HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.


until eventually:

2016-06-11 12:59:07 EDT:LOG:  duration: 26190082.497 ms  statement: SELECT pg_stop_backup()

The archive command appears to work as WAL files are being archived successfully with timestamps that occur while pg_stop_backup is hanging.

-rw------- 1 postgres postgres 314 Jun 11 12:56 000000030000109200000066.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 12:56 000000030000109300000029
-rw------- 1 postgres postgres 16M Jun 11 12:54 000000030000109300000028
-rw------- 1 postgres postgres 16M Jun 11 12:52 000000030000109300000027
<snip>
-rw------- 1 postgres postgres 16M Jun 11 06:22 0000000300001092000000F1
-rw------- 1 postgres postgres 335 Jun 11 05:42 0000000300001092000000D4.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 05:42 0000000300001092000000F0


Content of 0000000300001092000000D4.00000028.backup:

START WAL LOCATION: 1092/D4000028 (file 0000000300001092000000D4)
STOP WAL LOCATION: 1092/F008EA10 (file 0000000300001092000000F0)
CHECKPOINT LOCATION: 1092/D4004E08
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-06-11 04:51:04 EDT
LABEL: reporting 2016-06-11 04:51:03.892804-04
STOP TIME: 2016-06-11 05:42:37 EDT

This file shows that the backup took about an hour, doesn't it? So it doesn't look like it was waiting?

Or are you saying the wait happens *after* the .backup file has been dropped in the archives?

The .backup file gets written locally in a timely manner but doesn't get dropped into the archive until after the
pg_stop_backup() stops hanging.

So it appears that pg_basebackup ( perhaps --xlog-method=stream ? ) is causing the archive_command to pause which in turn
causes pg_stop_backup() to hang?

Thanks.

Alex




--

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux