Re: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Sat, 25 Jan 2025 09:32:27 -0800

On 1/25/25 09:03, Дмитрий wrote:
1) What sort of replication?
- Streaming replication

2) Where are the two servers located relative to each other?
- The servers are located in different data centers.

3) Has there been any software upgrades/network changes recently?
- I don't know any information about the  software upgrades/network

It would be a good thing to ask of those folks that do know.

From the log attached to your initial post:

2025-01-25 17:28:01.930 MSK [1196013] LOG:  starting PostgreSQL 15.10 on 
x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 
11.5.0-2), 64-bit
2025-01-25 17:28:01.930 MSK [1196013] LOG:  listening on IPv4 address 
"0.0.0.0", port 5432
2025-01-25 17:28:01.931 MSK [1196013] LOG:  listening on Unix socket 
"/run/postgresql/.s.PGSQL.5432"
2025-01-25 17:28:01.932 MSK [1196013] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"
2025-01-25 17:28:01.962 MSK [1196017] LOG:  database system was shut 
down in recovery at 2025-01-25 17:28:01 MSK
2025-01-25 17:28:01.962 MSK [1196017] LOG:  entering standby mode

How was it shut down, on purpose or a hardware/software issue?

Also do you have corresponding logs from primary?

2025-01-25 17:28:12.192 MSK [1196017] LOG:  consistent recovery state 
reached at 1063C/D002DC68
2025-01-25 17:28:12.192 MSK [1196017] LOG:  incorrect resource manager 
data checksum in record at 1063C/D002DC68
2025-01-25 17:28:12.192 MSK [1196013] LOG:  database system is ready to 
accept read-only connections
2025-01-25 17:28:12.205 MSK [1196019] LOG:  started streaming WAL from 
primary at 1063C/D0000000 on timeline 61

The recovery ended and the streaming started.

Not sure if 'incorrect resource manager data checksum' is significant or 
not.

2025-01-25 17:29:08.452 MSK [1196015] LOG:  recovery restart point at 
1063C/DBC7E1D8
2025-01-25 17:29:08.452 MSK [1196015] DETAIL:  Last completed 
transaction was at log time 2025-01-25 16:23:08.828548+03.
2025-01-25 17:29:24.553 MSK [1196015] LOG:  restartpoint starting: wal
2025-01-25 17:29:24.553 MSK [1196015] DEBUG:  performing replication 
slot checkpoint
2025-01-25 17:29:27.651 MSK [1196019] FATAL:  could not send data to WAL 
stream: lost synchronization with server: got message type "0", length 
892351284
2025-01-25 17:29:27.653 MSK [1196017] LOG:  invalid magic number 3600 in 
log segment 0000003D0001063D000000F4, offset 212992
2025-01-25 17:29:27.653 MSK [1196017] LOG:  invalid magic number 3600 in 
log segment 0000003D0001063D000000F4, offset 212992
2025-01-25 17:29:27.653 MSK [1196017] LOG:  invalid magic number 3600 in 
log segment 0000003D0001063D000000F4, offset 212992

This is where things fall apart. What confuses me is:

"could not send data to WAL stream: lost synchronization with server: 
got message type "0", length 892351284"

If this is from the standby why is it sending data to the stream?

Unless, is there cascading replication going on?

2025-01-25 17:30:01.887 MSK [1196013] LOG:  received fast shutdown request
2025-01-25 17:30:01.888 MSK [1196013] LOG:  aborting any active transactions

Was that a manual intervention?

2025-01-25 17:30:02.157 MSK [1196015] LOG:  shutting down
2025-01-25 17:30:02.181 MSK [1196013] LOG:  database system is shut down
2025-01-25 17:30:02.182 MSK [1196014] DEBUG:  logger shutting down

So the server went from start up to shut down in ~2 minutes.

From your original post:

'Restarting PostgreSQL helps.'

Is that what is shown above or have you restarted since the above and 
the server is running?

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx