Re: File system level backup of shut down standby does not work?

"Antman, Jason (CMG-Atlanta)" <Jason.Antman@xxxxxxxxxx> · Wed, 19 Feb 2014 13:12:40 +0000

That is correct. When the source server is shut down, it may or may not 
be caught up for replication (most likely not). You'll need all of the 
WAL files that haven't been applied yet; if for some reason you've got a 
large amount of replication lag, you might need more than just the first.

I'm not sure the best way to achieve this - we write WALs to a shared 
NFS mount, one directory per slave, so that gets mounted on the 
destination/standby and it has access. I'm not sure if this is practical 
without specialized hardware - our NFS is served from a NetApp storage 
appliance, so we have microsecond write latency.

If you want to have a "standalone" backup that doesn't need WAL files, 
it's a bit more complex. This is the best that I can figure out from our 
current setup. Please be cautious with it, I'm not 100% positive this 
includes everything, so I'd only try it on an isolated server that can 
risk having problems. This is also a bit of a hack...

Assuming PGDATA is /var/lib/pgsql/9.0/data, as the postgres user:
1) `touch /var/lib/pgsql/9.0/data/stop.replication`
2) make sure your recovery.conf includes:
trigger_file='/var/lib/pgsql/9.0/data/stop.replication'
recovery_end_command='/var/lib/pgsql/9.0/data/stop.replication'
standby_mode='on'
3) Start the database

When recovery finishes, it will attempt to execute recovery_end_command. 
Since this isn't actually executable, it raises an error, and stops 
postgres. This should result in a stopped database instance, in a 
consistent recovery state with all WAL files applied. You can confirm 
this by checking that `pg_controldata /var/lib/pgsql/9.0/data` output 
includes "Database cluster state: *shut down".

Be aware that we use an intermediate host for this. i.e. we stop a 
slave, copy PGDATA *and* ALL WAL files over to another machine, start 
the slave back up, and then do the recovery to a consistent state there, 
to end up with a standalone PGDATA directory that can be copied wherever 
we need it without any WAL files.

-Jason

On 02/19/2014 05:18 AM, Jürgen Fuchsberger wrote:
> All,
>
> One very important thing I just noted when shutting down and restarting
> my standby server:
>
> My standby server *always needs the last WAL-file* from the archive
> directory, even when the shut down was "smart". Without this the
> consistent recovery state will not be reached.
>
> 2014-02-19 11:10:20 CET LOG:  received smart shutdown request
> 2014-02-19 11:10:20 CET LOG:  shutting down
> 2014-02-19 11:10:20 CET LOG:  database system is shut down
> 2014-02-19 11:11:00 CET LOG:  database system was shut down in recovery
> at 2014-
> 02-19 11:10:20 CET
> 2014-02-19 11:11:00 CET LOG:  entering standby mode
> 2014-02-19 11:11:00 CET LOG:  incomplete startup packet
> 2014-02-19 11:11:01 CET FATAL:  the database system is starting up
>
> *2014-02-19 11:11:01 CET LOG:  restored log file*
> *"00000001000002DE000000BF" from archive*
>
> 2014-02-19 11:11:01 CET LOG:  redo starts at 2DE/BF036FA4
> 2014-02-19 11:11:01 CET FATAL:  the database system is starting up
> 2014-02-19 11:11:01 CET LOG:  consistent recovery state reached at
> 2DE/BFFFE53C
> 2014-02-19 11:11:01 CET LOG:  database system is ready to accept read
> only connections
>
> So my question is, could there be something wrong with my configuration
> or is this normal?
>
> Juergen
>
> On 02/19/2014 02:14 AM, Antman, Jason (CMG-Atlanta) wrote:
>> Juergen,
>>
>> I've seen this quite a lot in the past, as we do this multiple times a day.
>>
>> Here's the procedure we use to prevent it:
>> 1) read the PID from postmaster.pid in the data directory
>> 2) Issue "service postgresql-9.0 stop" (this does a fast shutdown with
>> -t 600)
>> 3) loop until the PID is no longer running, or a timeout is exceeded (in
>> which case we error out)
>> 4) the IMPORTANT part: `pg_controldata /path/to/data | grep "Database
>> cluster state: *shut down"` - if pg_controldata output doesn't include
>> "shut down" or "shut down in recovery", then something's amiss and the
>> backup won't be clean (error in shutdown, etc.)
>> 5) `sync`
>> 6) now take the backup
>>
>> -Jason
>>
>> On 02/17/2014 08:32 AM, Jürgen Fuchsberger wrote:
>>> Hi all,
>>>
>>> I have a master-slave configuration running the master with WAL
>>> archiving enabled and the slave in recovery mode reading back the WAL
>>> files from the master ("Log-shipping standby" as described in
>>> http://www.postgresql.org/docs/9.1/static/warm-standby.html)
>>>
>>> I take frequent backups of the standby server:
>>>
>>> 1) Stop standby server (fast shutdown).
>>> 2) Rsync to another fileserver
>>> 3) Start standby server.
>>>
>>> I just tried to recover one of these backups which *failed* with the
>>> following errors:
>>>
>>> 2014-02-17 14:27:28 CET LOG:  incomplete startup packet
>>> 2014-02-17 14:27:28 CET LOG:  database system was shut down in recovery
>>> at 2013-12-25 18:00:03 CET
>>> 2014-02-17 14:27:28 CET LOG:  could not open file
>>> "pg_xlog/00000001000001E300000061" (log file 483, segment 97): No such
>>> file or directory
>>> 2014-02-17 14:27:28 CET LOG:  invalid primary checkpoint record
>>> 2014-02-17 14:27:28 CET LOG:  could not open file
>>> "pg_xlog/00000001000001E300000060" (log file 483, segment 96): No such
>>> file or directory
>>> 2014-02-17 14:27:28 CET LOG:  invalid secondary checkpoint record
>>> 2014-02-17 14:27:28 CET PANIC:  could not locate a valid checkpoint record
>>> 2014-02-17 14:27:29 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:29 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:30 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:30 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:31 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:31 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:32 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:33 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:33 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:33 CET LOG:  startup process (PID 26186) was terminated
>>> by signal 6: Aborted
>>> 2014-02-17 14:27:33 CET LOG:  aborting startup due to startup process
>>> failure
>>>
>>>
>>> So it seems the server is missing some WAL files which are not
>>> in the backup? Or is it simply not possible to take a backup of a
>>> standby server in recovery?
>>>
>>> Best,
>>> Juergen
>>>
>>>
>>>
>>

-- 

Jason Antman | Systems Engineer | CMGdigital
jason.antman@xxxxxxxxxx | p: 678-645-4155

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general