Re: Strange times in WAL files in archive directory (9.3)

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Tue, 24 Jan 2017 18:04:35 +0200

On 24/01/2017 17:55, Stephen Frost wrote:
Greetings,

* Achilleas Mantzios (achill@xxxxxxxxxxxxxxxxxxxxx) wrote:
On 24/01/2017 16:55, Stephen Frost wrote:
* Achilleas Mantzios (achill@xxxxxxxxxxxxxxxxxxxxx) wrote:
I provided the archive_command in the 1st post. The copy is against another host (called sma in the command) :
archive_command = '/usr/bin/scp %p sma:/smadb/pgsql/pitr/%f'
Note that this is not a recommended archive command- there is no
guarantee that the copied WAL files are sync'd to disk on the 'sma' host
and you could end up losing, potentially, a significant amount of your
WAL on a failure.
I had changed that already to
/usr/bin/rsync -a --ignore-existing %p sma:/smadb/pgsql/pitr/%f
--ignore-existing is actually a *bad* idea, really.  There can be cases
where PG will end up calling archive_command on the same file, but in
those cases you should really be checking that the two WAL files are
IDENTICAL, otherwise you may have a misconfigured system and are pushing
the WAL for two different PG systems to the same directory.

So you say that scp does not perform a sync on the destination file? So that in case of a remote crash it might return 0 while the file isn't written?
Yes, if the remote system crashes right after rsync (or scp) has
returned, the WAL file may not have been sync'd to reliable storage and
will be lost.

Thanks for the suggestions. We have been using a wal archiving +
base backups + streaming replication combination for years, so I
guess we'll be alright for the time being. Point is that we recently
moved to a cloud environment and we have to "port" our traditional
operations to the utilities/tools provided by the cloud provider.
I would not go on the assumption that since it's been working that it
won't ever fail in an unfortunate way.

Also, always, always, always test your backups.  All of them, ideally,
otherwise they may not work when you need them most.

Anyway, if there is any theory or confirmation on my assumptions for the main question of this thread?
At first blush, I'd guess that someone else put those files there or
that the time changed on one of the systems involved.
No one else worked on this, and ntp has been running correctly for both systems. This was my first guess as well but no.
I guess PostgreSQL just flushed them to the archive before deleting/renaming them. Does it make any sense?

Thanks!

Stephen

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin