I encountered mgr ballooning multiple times with Luminous, but have not since. At the time, I could often achieve relief by sending the admin socket a heap release - it would show large amounts of memory unused but not yet released. That experience is one reason I got Rook recently to allow provisioning more than two mgrs. > On Nov 21, 2023, at 14:52, Eugen Block <eblock@xxxxxx> wrote: > > Just checking it on the phone, but isn’t this quite similar? > > https://tracker.ceph.com/issues/45136 > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> Hi, >> >> I'm facing a rather new issue with our Ceph cluster: from time to time >> ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over >> 100 GB RAM: >> >> [Nov21 15:02] tp_osd_tp invoked oom-killer: >> gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 >> [ +0.000010] oom_kill_process.cold+0xb/0x10 >> [ +0.000002] [ pid ] uid tgid total_vm rss pgtables_bytes >> swapents oom_score_adj name >> [ +0.000008] >> oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167 >> [ +0.000697] Out of memory: Killed process 3941610 (ceph-mgr) >> total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB, shmem-rss:0kB, >> UID:167 pgtables:260356kB oom_score_adj:0 >> [ +6.509769] oom_reaper: reaped process 3941610 (ceph-mgr), now >> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB >> >> The cluster is stable and operating normally, there's nothing unusual going >> on before, during or after the kill, thus it's unclear what causes the mgr >> to balloon, use all RAM and get killed. Systemd logs aren't very helpful: >> they just show normal mgr operations until it fails to allocate memory and >> gets killed: https://pastebin.com/MLyw9iVi >> >> The mgr experienced this issue several times in the last 2 months, and the >> events don't appear to correlate with any other events in the cluster >> because basically nothing else happened at around those times. How can I >> investigate this and figure out what's causing the mgr to consume all >> memory and get killed? >> >> I would very much appreciate any advice! >> >> Best regards, >> Zakhar >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx