Hello. At Mon, 21 May 2018 05:18:57 -0700 (MST), greigwise <greigwise@xxxxxxxxxxx> wrote in <1526905137308-0.post@xxxxxxxxxxxxx> > Hello. > > We are on Postgresql version 9.6.6. We have 2 EC2 instances in different > Amazon regions and we are doing physical replication via VPN. It all seems > to work just fine most of the time. I'm noticing in the logs that we have > recurring erros (maybe 10 or 12 times per day) that look like this: <following is digested> > 2018-05-17 06:36:14 UTC 5af0599f.210d LOG: invalid resource manager ID 49 > 2018-05-17 06:36:14 UTC 5afd22de.7ac4 LOG: started streaming WAL from > 2018-05-17 07:20:17 UTC 5afd22de.7ac4 FATAL: could not receive data from > WAL stream: server closed the connection unexpectedly > Or some that also look like this: > > 2018-05-17 07:20:17 UTC 5af0599f.210d LOG: record with incorrect prev-link > 2018-05-17 07:20:18 UTC 5afd2d31.1889 LOG: started streaming WAL from > 2018-05-17 08:03:28 UTC 5afd2d31.1889 FATAL: could not receive data from > WAL stream: server closed the connection unexpectedly > And some like this: > > 2018-05-17 23:00:13 UTC 5afd63ec.26fc LOG: invalid magic number 0000 in > log segment 00000001000003850000003C, offset 10436608 > 2018-05-17 23:00:14 UTC 5afe097d.49aa LOG: started streaming WAL from > primary at 385/3C000000 on timeline 1 You recplication connection seems quite unstable and disconnected frequently. After disconnection, you will see several kinds of "I find a broken record in my WAL file" and they are cues for standby to switch to streaming. Itself is a normal operation as PostgreSQL with one known exception. > Then, like maybe once every couple months or so, we have a crash with logs > looking like this: > 2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING: > terminating connection because of crash of another server process I think the lines follow an error message like "FATAL: invalid memory alloc request size 3075129344". This is a kind of "broken record" but it is known to lead standby to crash. It is disucussed here. > [bug fix] Cascaded standby cannot start after a clean shutdown https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05#0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 > When this last error occurs, the recovery is to go on the replica and remove > all the WAL logs from the pg_xlog director and then restart Postgresql. > Everything seems to recover and come up fine. I've done some tests > comparing counts between the replica and the primary and everything seems > synced just fine from all I can tell. It is right recovery steps, as far as looking the attached log messages. > So, a couple of questions. 1) Should I be worried that my replica is > corrupt in some way or given that everything *seems* ok, is it reasonable to > believe that things are working correctly in spite of these errors being > reported. 2) Is there something I should configure differently to avoid > some of these errors? It doesn't seem worth warrying from the viewpoint of data integrity, but if walsender/walreceiver timeouts fire too frequently, you might need to increase them for increased stability. > Thanks in advance for any help. > > Greig Wise regards. -- Kyotaro Horiguchi NTT Open Source Software Center