Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdown checkpoint

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Mon, 12 Nov 2018 16:00:39 +0200

On 6/11/18 4:58 μ.μ., Tom Lane wrote:
Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> writes:
Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping those files afloat should also hold at this point, right.
The question is why didn't Postgresql removed them earlier.
WAL files get removed/recycled after completion of a checkpoint.  So
apparently, checkpoints were not finishing during normal operation,
but the shutdown checkpoint managed to terminate normally.  That
eliminates a lot of the usual theories about why checkpoints might
not be succeeding (like a dirty buffer that always fails to be
written, say as a result of broken permissions on its file).

The only theory that comes to mind is that the checkpointer process
was stuck somehow, but just "soft" stuck, in a way that allowed the
postmaster's time-to-shut-down-please signal to unstick it.  No,
I have no idea how that could happen exactly.  If it happens again,
it'd be really interesting to attach to the checkpointer with a
debugger and collect a stack trace.

Hello Tom,
I enabled debugging as you suggested.
In the meantime yesterday (also Sunday) we had an identical incident as far symptoms are concerned , only that in this (yesterday) case I fully explained it, and this is reported here : 
https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com
What happened is that walsender died on Friday, replication slot stuck, server kept wal files, pg_wal disk was full (on Sunday morning), system PANIC'ed :

10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:9 LOG:  terminating walsender process due to replication timeout
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:10 CONTEXT:  slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change
callback, associated LSN 13DF/393BF7F0
10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr(at)dynacom line:11 LOG: disconnection: session time: 49:47:17.937 user=repmgr database=dynacom
host=10.9.0.77 port=48650

(so this problem appeared 2 days after walsender died)

So, I tried to find evidence that would connect the two incidents (this and previous Sunday) but could not find anything in either the publisher (primary) or the subscriber logs that would match wal 
sender or logical replication worker terminate or exit. *but* I found this on the primary's log back from 2018-11-01 :

10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG:  could not send data to client: Connection reset by peer
10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:10 CONTEXT:  slot "data_for_testsmadb_pub", output plugin "pgoutput", in the commit 
callback, associated LSN 1290/6ADB518
10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.434 EET data_for_testsmadb_pub repmgr@dynacom line:11 LOG:  disconnection: session time: 29:50:03.592 user=repmgr database=dynacom 
host=10.9.0.77 port=47489
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.476 EET [unknown] [unknown]@[unknown] line:1 LOG:  connection received: host=10.9.0.77 port=47612
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.477 EET [unknown] repmgr@dynacom line:2 LOG:  replication connection authorized: user=repmgr
10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.478 EET data_for_testsmadb_pub repmgr@dynacom line:3 LOG:  received replication command: IDENTIFY_SYSTEM
......
above pattern repeats 9 more times
(so this problem appeared 3 days after this connection reset)

In this case the wal sender did not terminate, but I wonder if it caused this "soft" stuck in the replication slot?
In this primary server we have only one replication slot (the one to the logical subscriber node), and archiving works without problems for years. So I cannot think of another reason that wals were 
accumulating in pg_wal .

			regards, tom lane

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt