Errors with physical replication

greigwise <greigwise@xxxxxxxxxxx> · Mon, 21 May 2018 05:18:57 -0700 (MST)

Hello.  

We are on Postgresql version 9.6.6.  We have 2 EC2 instances in different
Amazon regions and we are doing physical replication via VPN.  It all seems
to work just fine most of the time.   I'm noticing in the logs that we have
recurring erros (maybe 10 or 12 times per day) that look like this:

2018-05-17 06:36:14 UTC  5af0599f.210d  LOG:  invalid resource manager ID 49
at 384/42A4AB00
2018-05-17 06:36:14 UTC  5afd22de.7ac4  LOG:  started streaming WAL from
primary at 384/42000000 on timeline 1
2018-05-17 07:20:17 UTC  5afd22de.7ac4  FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

Or some that also look like this:

2018-05-17 07:20:17 UTC  5af0599f.210d  LOG:  record with incorrect
prev-link 49F07120/9F100C95 at 384/45209FC0
2018-05-17 07:20:18 UTC  5afd2d31.1889  LOG:  started streaming WAL from
primary at 384/45000000 on timeline 1
2018-05-17 08:03:28 UTC  5afd2d31.1889  FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

And some like this:

2018-05-17 23:00:13 UTC  5afd63ec.26fc  LOG:  invalid magic number 0000 in
log segment 00000001000003850000003C, offset 10436608
2018-05-17 23:00:14 UTC  5afe097d.49aa  LOG:  started streaming WAL from
primary at 385/3C000000 on timeline 1

Then, like maybe once every couple months or so, we have a crash with logs
looking like this:

2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING: 
terminating connection because of crash of another server process
2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology HINT:  In a moment
you should be able to reconnect to the database and repeat your command.
2018-05-17 08:03:28 UTC  5af0599f.210a  LOG:  database system is shut down

When this last error occurs, the recovery is to go on the replica and remove
all the WAL logs from the pg_xlog director and then restart Postgresql. 
Everything seems to recover and come up fine.  I've done some tests
comparing counts between the replica and the primary and everything seems
synced just fine from all I can tell.  

So, a couple of questions.  1) Should I be worried that my replica is
corrupt in some way or given that everything *seems* ok, is it reasonable to
believe that things are working correctly in spite of these errors being
reported.  2)  Is there something I should configure differently to avoid
some of these errors?

Thanks in advance for any help.

Greig Wise

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html