Pg stuck at 100% cpu, for multiple days

hubert depesz lubaczewski <depesz@xxxxxxxxxx> · Mon, 30 Aug 2021 17:15:27 +0200

Hi,

Originally I posted it on -general, but Joe Conway suggested I repost in
here for greater visibility...

We hit a problem with Pg 12.6 (I know, we should upgrade, but that will
take long time to prepare).

Anyway - it's 12.6 on aarm64.
Couple of days there was replication slot started, and now it seems to
be stuck.

=# select * from pg_stat_activity  where pid = 22697 \gx
─[ RECORD 1 ]────┬──────────────────────────────────────────────────────────
datid            │ 16591
datname          │ canvas
pid              │ 22697
usesysid         │ 16505
usename          │ <CENSORED>
application_name │ PostgreSQL JDBC Driver
client_addr      │ <CENSORED>
client_hostname  │ [null]
client_port      │ 43160
backend_start    │ 2021-08-18 02:12:05.758297+00
xact_start       │ [null]
query_start      │ 2021-08-18 02:12:05.772271+00
state_change     │ 2021-08-18 02:12:05.773428+00
wait_event_type  │ [null]
wait_event       │ [null]
state            │ active
backend_xid      │ [null]
backend_xmin     │ [null]
query            │ SELECT COUNT(1) FROM pg_publication WHERE pubname = 'cdc'
backend_type     │ walsender

=# select pg_current_wal_lsn(), pg_size_pretty( pg_current_wal_lsn() - sent_lsn), * from pg_stat_replication where pid = 22697 \gx
─[ RECORD 1 ]──────┬──────────────────────────────
pg_current_wal_lsn │ 1B14/718EA0B8
pg_size_pretty     │ 290 GB
pid                │ 22697
usesysid           │ 16505
usename            │ <CENSORED>
application_name   │ PostgreSQL JDBC Driver
client_addr        │ <CENSORED>
client_hostname    │ [null]
client_port        │ 43160
backend_start      │ 2021-08-18 02:12:05.758297+00
backend_xmin       │ [null]
state              │ streaming
sent_lsn           │ 1ACC/D8689A8
write_lsn          │ 1ACC/D527BD8
flush_lsn          │ 1ACC/C97DF48
replay_lsn         │ 1ACC/C97DF48
write_lag          │ 00:00:00.257041
flush_lag          │ 00:00:01.26665
replay_lag         │ 00:00:01.26665
sync_priority      │ 0
sync_state         │ async
reply_time         │ 1999-12-21 03:15:13.449225+00

top shows the process using 100% of cpu. I tried strace'ing, but strace doesn't
show *anything* - it just sits there.

Got backtrace:

~# gdb --pid=22697 --batch -ex backtrace
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
1448    ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0  hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
#1  0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at ./build/../src/backend/utils/cache/relfilenodemap.c:64
#2  0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at ./build/../src/backend/utils/cache/inval.c:595
#3  0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998 <formatted_start_time>, txn=0xaaaac326b998 <formatted_start_time>) at ./build/../src/backend/replication/logical/reorderbuffer.c:2149
#4  ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, commit_lsn=187650393290540, end_lsn=<optimized out>, commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1770
#5  0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960, ctx=0xaaaaf5d396a0) at ./build/../src/backend/replication/logical/decode.c:640
#6  DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf@entry=0xffffdd90f9c0) at ./build/../src/backend/replication/logical/decode.c:248
#7  0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at ./build/../src/backend/replication/logical/decode.c:117
#8  0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840
#9  0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2189
#10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at ./build/../src/backend/replication/walsender.c:1133
#11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL 1A2D/4B3640 (\"proto_version\" '1', \"publication_names\" 'cdc')") at ./build/../src/backend/replication/walsender.c:1549
#12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4257
#13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4484
#14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228

The thing is - I can't close it with pg_terminate_backend(), and I'd
rather not kill -9, as it will, I think, close all other connections,
and this is prod server.

The other end of the connection was something in kubernetes, and it no
longer exists.

Is there anything we could do about it?

Best regards,

depesz