Hi, I'm facing a rather new issue with our Ceph cluster: from time to time ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over 100 GB RAM: [Nov21 15:02] tp_osd_tp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ +0.000010] oom_kill_process.cold+0xb/0x10 [ +0.000002] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ +0.000008] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167 [ +0.000697] Out of memory: Killed process 3941610 (ceph-mgr) total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:260356kB oom_score_adj:0 [ +6.509769] oom_reaper: reaped process 3941610 (ceph-mgr), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB The cluster is stable and operating normally, there's nothing unusual going on before, during or after the kill, thus it's unclear what causes the mgr to balloon, use all RAM and get killed. Systemd logs aren't very helpful: they just show normal mgr operations until it fails to allocate memory and gets killed: https://pastebin.com/MLyw9iVi The mgr experienced this issue several times in the last 2 months, and the events don't appear to correlate with any other events in the cluster because basically nothing else happened at around those times. How can I investigate this and figure out what's causing the mgr to consume all memory and get killed? I would very much appreciate any advice! Best regards, Zakhar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx