Re: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Sun, 26 Jan 2025 09:33:06 -0800

On 1/26/25 03:29, Дмитрий wrote:
"How was it shut down, on purpose or a hardware/software issue?"
- I reboot the receiver every 2 minutes on purpose. I determined this 
time empirically, because replication breaks down approximately every 
minute and a half. The reboot helps to advance the receiver.

"Also do you have corresponding logs from primary?"
- Attached to this message.

"Unless, is there cascading replication going on?"
- No, this is replication from the leader. The leader has its two 
replicas and they are all in one data center. And the problematic 
replica is needed to migrate to another data center.

"Was that a manual intervention?"
- Yes, reboot on schedule, every two minutes.

"Is that what is shown above or have you restarted since the above and
the server is running?"
- Sometimes replication works without problems for several hours. But 
when a breakdown occurs, rebooting every two minutes helps to catch up 
with this replica.

1) It would make life easier if the log line entry prefix timestamp was 
set to same precision on primary and standby. As of now it looks like 
the primary has %t (Time stamp without milliseconds) and the standby has
%m (Time stamp with milliseconds)

2) From the logs.

Primary:

2025-01-26 12:21:27 MSK [656]: [11-1] 
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 STATEMENT: 
 START_REPLICATION SLOT "slot_migration_to_rcod" 106B6/52000000 TIMELINE 61

2025-01-26 12:21:27 MSK [656]: [12-1] 
app=v-host-n1,user=replicator,db=[unknown],client=192.168.5.1 LOG: 
disconnection: session time: 0:01:05.329 user=replicator database= 
host=192.168.5.1 port=58380

Standby:

2025-01-26 12:21:27.113 MSK [10824] FATAL:  could not send data to WAL 
stream: lost synchronization with server: got message type "0", length 
825373235

Do you know what is doing START_REPLICATION SLOT?

Another interesting point. In addition to this replication, there are 
two more, to the same data center. One replication had the same problem, 
but a one-time restart helped to solve the problem, the replication is 
still working normally. And the second replication does not have such 
problems, it has been working since its launch, more than a month ago.

--

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx