Search Postgresql Archives

walreceiver termination

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We've seen unexpected termination of the WAL receiver process.  This
stops streaming replication, but the replica stays available --
restarting the server resumes streaming replication where it left off.
We've seen this across nearly every recent version of PG, (9.4, 9.5,
11.x, 12.x) -- anything omitted is one we haven't used.

I don't have an explanation for the cause, but I was able to set
logging to "debug5" and run an strace of the walrecevier PID when it
eventually happened.  It appears as if the SIGTERM is coming from the
"postgres: startup" process.  The only notable thing that I've seen
discussion of (in the past) and where it might be relevant here is
that these replicas all reside on a ZFS filesystem (on Linux).  In
fact, everywhere we've seen issues, the database sits on a ZFS pool.
We also have another PG cluster running on non-ZFS (ext4) and have
never experienced this problem.

Any suggestions?

Logging snippets are below:

postgresql.log
Apr 23 09:09:13 royce postgres[2229]: [43141904-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG:  record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141904-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT:  WAL redo at 2D6/C259A6E0 for
Btree/INSERT_LEAF: off 88
Apr 23 09:09:13 royce postgres[2229]: [43141905-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG:  record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141905-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT:  WAL redo at 2D6/C259A720 for
Heap/HEAP_CONFIRM: off 6
Apr 23 09:09:13 royce postgres[2229]: [43141906-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG:  record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141906-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT:  WAL redo at 2D6/C259A750 for
Heap/INSERT: off 7 flags 0x0C
Apr 23 09:09:13 royce postgres[2229]: [43141907-1] 2020-04-23
09:09:13.663 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448555-1] 2020-04-23
09:09:13.663 GMT [2299] FATAL:  terminating walreceiver process due to
administrator command
Apr 23 09:09:13 royce postgres[2299]: [1448556-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1): 1 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2299]: [1448557-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1): 6 on_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141908-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141909-1] 2020-04-23
09:09:13.664 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448558-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG:  proc_exit(1): 2 callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141910-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2229]: [43141911-1] 2020-04-23
09:09:13.664 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448559-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG:  exit(1)
Apr 23 09:09:13 royce postgres[2229]: [43141912-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141913-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  incorrect resource manager data
checksum in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2229]: [43141914-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2299]: [1448560-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG:  shmem_exit(-1): 0 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141915-1] 2020-04-23
09:09:13.664 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D

process id list:
2224 ?        Ss     0:24 /usr/lib/postgresql/12/bin/postgres -D
/var/lib/postgresql/12/main -c
unix_socket_directories=/var/run/postgresql -c
config_file=/etc/postgresql/12/main/postgresql.conf -c
hba_file=/etc/postgresql/12/main/pg_hba.conf -c
ident_file=/etc/postgresql/12/main/pg_ident.conf -c
external_pid_file=/var/run/postgresql/12-main.pid
 2229 ?        Ss     7:02 postgres: startup
 2230 ?        Ss     6:09 postgres: checkpointer
 2231 ?        Ss     0:08 postgres: background writer
 2298 ?        Ss     1:29 postgres: stats collector
 2299 ?        Ss     6:11 postgres: walreceiver

strace of walreceiver:
kill(2229, SIGUSR1)                     = 0
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448554-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  sending write
2D6/C25A0000 flush 2D6/C25A0000 apply 2D6/C25915F0", 156,
MSG_NOSIGNAL, NULL, 0) = 156
sendto(4<socket:[208548975]>,
"\x64\x00\x00\x00\x26\x72\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x59\x15\xf0\x00\x02\x46\xf0\x5f\x04\xbe\x9e\x00",
39, MSG_NOSIGNAL, NULL, 0) = 39
epoll_create1(EPOLL_CLOEXEC)            = 6<anon_inode:[eventpoll]>
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
10<pipe:[208548970]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640776,
u64=94863290307528}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
7<pipe:[208548681]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640800,
u64=94863290307552}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
4<socket:[208548975]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640824,
u64=94863290307576}}) = 0
epoll_wait(6<anon_inode:[eventpoll]>, 0x564714b89410, 1, 100) = -1
EINTR (Interrupted system call)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=2229, si_uid=112} ---
write(11<pipe:[208548970]>, "\x00", 1)  = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
close(6<anon_inode:[eventpoll]>)        = 0
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<131>Apr 23 09:09:13 postgres[2299]:
[1448555-1] 2020-04-23 09:09:13.663 GMT [2299] FATAL:  terminating
walreceiver process due to administrator command", 152, MSG_NOSIGNAL,
NULL, 0) = 152
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448556-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1):
1 before_shmem_exit callbacks to make", 144, MSG_NOSIGNAL, NULL, 0) =
144
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448557-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1):
6 on_shmem_exit callbacks to make", 140, MSG_NOSIGNAL, NULL, 0) = 140
sendto(4<socket:[208548975]>, "\x58\x00\x00\x00\x04", 5, MSG_NOSIGNAL,
NULL, 0) = 5


Any suggestions?

Thanks-
Justin





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux