Thanks, Ninad. Looks like there is some error in 0000000100013D94000000FF. Any way to tell if this is logical corruption or physical corruption. In other words if this is file system corruption or of postgres generated corrupted file?
pg_waldump -q 0000000100013D94000000FE
[no errors]
pg_waldump -q 0000000100013D94000000FF
pg_waldump: fatal: error in WAL record at 13D94/FFBFFF48: invalid magic number 0000 in log segment 0000000100013D94000000FF, offset 12582912
pg_waldump -q 0000000100013D9500000000
[no errors]
From: Ninad Shah <ninad.shah@xxxxxxxxxxx>
Sent: Sunday, June 23, 2024 7:16 AM
To: Murthy Nunna <mnunna@xxxxxxxx>
Cc: pgsql-admin@xxxxxxxxxxxxxx
Subject: Re: Replication is stuck
[EXTERNAL] – This message is from an external sender
Hi Murthy,
Would you please generate a pg_waldump of 0000000100013D94000000FF, 0000000100013D94000000FE and 0000000100013D9500000000?
Thanks,
--
On Sun, Jun 23, 2024 at 5:32 PM Murthy Nunna <mnunna@xxxxxxxx> wrote:
I am running pg14.4. I use WAL replication in a stand-by server which is 7-days behind primary (recovery_min_apply_delay = 7d)
My replication is stuck. It looks like it is repeatedly applying same WAL file. The next WAL file(s) are very much there.
I restarted cluster but it didn’t fix the issue.
I appreciate any help you can provide before I rebuild the stand-by. I am trying to find the root cause. If 0000000100013D94000000FF is corrupted how can we tell?
2024-06-23 06:54:57 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:02 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:07 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:12 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:17 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:22 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:27 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:32 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:37 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
2024-06-23 06:55:42 CDT []LOG: restored log file "0000000100013D94000000FF" from archive
There are no missing WALs:
ls -ltr 0000000100013D95000000* |more
-rw------- 1 postgres postgres 16777216 Jun 14 19:39 0000000100013D9500000000
-rw------- 1 postgres postgres 16777216 Jun 14 19:39 0000000100013D9500000001
-rw------- 1 postgres postgres 16777216 Jun 14 19:39 0000000100013D9500000002
-rw------- 1 postgres postgres 16777216 Jun 14 19:39 0000000100013D9500000003
-rw------- 1 postgres postgres 16777216 Jun 14 19:40 0000000100013D9500000004
-rw------- 1 postgres postgres 16777216 Jun 14 19:40 0000000100013D9500000005
-rw------- 1 postgres postgres 16777216 Jun 14 19:40 0000000100013D9500000006
-rw------- 1 postgres postgres 16777216 Jun 14 19:40 0000000100013D9500000007
-rw------- 1 postgres postgres 16777216 Jun 14 19:40 0000000100013D9500000008
-rw------- 1 postgres postgres 16777216 Jun 14 19:40 0000000100013D9500000009
-rw------- 1 postgres postgres 16777216 Jun 14 19:41 0000000100013D950000000A
-rw------- 1 postgres postgres 16777216 Jun 14 19:41 0000000100013D950000000B
Your WAL file is corrupted. It's not possible to restore.
Thanks,
--
On Sun, Jun 23, 2024 at 6:04 PM Murthy Nunna <mnunna@xxxxxxxx> wrote: