Ceph 16.2.14: ceph-mgr getting oom-killed

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Tue, 21 Nov 2023 21:18:57 +0200

Hi,

I'm facing a rather new issue with our Ceph cluster: from time to time
ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over
100 GB RAM:

[Nov21 15:02] tp_osd_tp invoked oom-killer:
gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[  +0.000010]  oom_kill_process.cold+0xb/0x10
[  +0.000002] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
swapents oom_score_adj name
[  +0.000008]
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167
[  +0.000697] Out of memory: Killed process 3941610 (ceph-mgr)
total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB, shmem-rss:0kB,
UID:167 pgtables:260356kB oom_score_adj:0
[  +6.509769] oom_reaper: reaped process 3941610 (ceph-mgr), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The cluster is stable and operating normally, there's nothing unusual going
on before, during or after the kill, thus it's unclear what causes the mgr
to balloon, use all RAM and get killed. Systemd logs aren't very helpful:
they just show normal mgr operations until it fails to allocate memory and
gets killed: https://pastebin.com/MLyw9iVi

The mgr experienced this issue several times in the last 2 months, and the
events don't appear to correlate with any other events in the cluster
because basically nothing else happened at around those times. How can I
investigate this and figure out what's causing the mgr to consume all
memory and get killed?

I would very much appreciate any advice!

Best regards,
Zakhar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx