On 11/12/2018 14:48, Achilleas Mantzios wrote:
On 11/12/18 4:00 μ.μ., Chris Withers wrote:
I'm looking after a multi-tenant PG 9.4 cluster, and we've started
getting alerts for the number of WALs on the server.
It'd be great to understand what's generating all that WAL and what's
likely to be causing any problems.\
Regarding you wals in pg_wal, a good threshold could be anything more
than a e.g. 10% increase from wal_keep_segments with a trend to go up.
If this number goes up chances are something bad is happening.
wal_keep_segments is 0 here, so I went hunting in pg_replication_slots
and found that it's the barman slot that's behind on the WAL.
strace'ing on the barman receiving side shows it waiting on the socket,
so no problem on that side.
on the sending side, it's a little confusing, since the server is
basically idle, no cpu or disk activity, and yet strace'ing the sending
process attached to the other end of the socket shows time being spent
waiting on a poll which, while it includes the socket being sent to,
also includes a bunch of pipes. I've attached a chunk of the output
below in the hope that someone on this list could offer an explanation
as to what might cause the WAL to be trickling over this port rather
than catching up as fast as it can?
cheers,
Chris
strace output for the streaming_barman slot servicing the barman
replication slot:
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
write(4, "\0", 1) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
read(3, "\0", 16) = 1
read(6, 0x7fffffffdcd7, 1) = -1 EAGAIN (Resource
temporarily unavailable)
recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
read(5,
"\3\1\0\0\7\326\355\2\343\0\0\0@\n\0\0\0\17\1\230\313\4\0\0)\347\372l\0\0\0\0"...,
4272) = 4272
sendto(10, "d\0\0\20\315w\0\0\4\313\230\1\17P\0\0\4\313\230\1
\0\0\2\37\321\343\345\306\20\3\1"..., 4302, 0, NULL, 0) = 4302
read(3, 0x7fffffffdc90, 16) = -1 EAGAIN (Resource
temporarily unavailable)
poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 26076) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
write(4, "\0", 1) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
read(3, "\0", 16) = 1
read(6, 0x7fffffffdcd7, 1) = -1 EAGAIN (Resource
temporarily unavailable)
recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
read(5, "~\320\5\0\3\0\0\0\0
\1\230\313\4\0\0\3\0\0\0\0\0\0\0\323\316\314\0\0\0\0\0"..., 8192) = 8192
sendto(10, "d\0\0 \35w\0\0\4\313\230\1
\0\0\0\4\313\230\1@\0\0\2\37\321\343\361\376\226~\320"..., 8222, 0,
NULL, 0) = 8222
read(3, 0x7fffffffdc90, 16) = -1 EAGAIN (Resource
temporarily unavailable)
poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 25275) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} ---
Interesting numbers:
- process 3577 is the wal writer process
- fd 10 is the socket, 3 and 6 are pipes
- time is visibly spent sitting on that poll
- the network is only seeing 2Mbit/sec, which is nothing
cheers,
Chris