Yes, we use docker, though we haven't had any issues because of it. I don't think that docker itself can cause mgr memory leaks. /Z On Wed, 22 Nov 2023, 15:14 Eugen Block, <eblock@xxxxxx> wrote: > One other difference is you use docker, right? We use podman, could it > be some docker restriction? > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > > > It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has 384 > > GB of RAM, each OSD has a memory target of 16 GB, about 100 GB of memory, > > give or take, is available (mostly used by page cache) on each node > during > > normal operation. Nothing unusual there, tbh. > > > > No unusual mgr modules or settings either, except for disabled progress: > > > > { > > "always_on_modules": [ > > "balancer", > > "crash", > > "devicehealth", > > "orchestrator", > > "pg_autoscaler", > > "progress", > > "rbd_support", > > "status", > > "telemetry", > > "volumes" > > ], > > "enabled_modules": [ > > "cephadm", > > "dashboard", > > "iostat", > > "prometheus", > > "restful" > > ], > > > > /Z > > > > On Wed, 22 Nov 2023, 14:52 Eugen Block, <eblock@xxxxxx> wrote: > > > >> What does your hardware look like memory-wise? Just for comparison, > >> one customer cluster has 4,5 GB in use (middle-sized cluster for > >> openstack, 280 OSDs): > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > >> COMMAND > >> 6077 ceph 20 0 6357560 4,522g 22316 S 12,00 1,797 > >> 57022:54 ceph-mgr > >> > >> In our own cluster (smaller than that and not really heavily used) the > >> mgr uses almost 2 GB. So those numbers you have seem relatively small. > >> > >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> > >> > I've disabled the progress module entirely and will see how it goes. > >> > Otherwise, mgr memory usage keeps increasing slowly, from past > experience > >> > it will stabilize at around 1.5-1.6 GB. Other than this event warning, > >> it's > >> > unclear what could have caused random memory ballooning. > >> > > >> > /Z > >> > > >> > On Wed, 22 Nov 2023 at 13:07, Eugen Block <eblock@xxxxxx> wrote: > >> > > >> >> I see these progress messages all the time, I don't think they cause > >> >> it, but I might be wrong. You can disable it just to rule that out. > >> >> > >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> >> > >> >> > Unfortunately, I don't have a full stack trace because there's no > >> crash > >> >> > when the mgr gets oom-killed. There's just the mgr log, which looks > >> >> > completely normal until about 2-3 minutes before the oom-kill, when > >> >> > tmalloc warnings show up. > >> >> > > >> >> > I'm not sure that it's the same issue that is described in the > >> tracker. > >> >> We > >> >> > seem to have some stale "events" in the progress module though: > >> >> > > >> >> > Nov 21 14:56:30 ceph01 bash[3941523]: debug > >> 2023-11-21T14:56:30.718+0000 > >> >> > 7f4bb19ef700 0 [progress WARNING root] complete: ev > >> >> > cacc4230-75ee-4892-b8fd-a19fec8f9f66 does not exist > >> >> > Nov 21 14:56:30 ceph01 bash[3941523]: debug > >> 2023-11-21T14:56:30.718+0000 > >> >> > 7f4bb19ef700 0 [progress WARNING root] complete: ev > >> >> > 44824331-3f6b-45c4-b925-423d098c3c76 does not exist > >> >> > Nov 21 14:56:30 ceph01 bash[3941523]: debug > >> 2023-11-21T14:56:30.718+0000 > >> >> > 7f4bb19ef700 0 [progress WARNING root] complete: ev > >> >> > 0139bc54-ae42-4483-b278-851d77f23f9f does not exist > >> >> > Nov 21 14:56:30 ceph01 bash[3941523]: debug > >> 2023-11-21T14:56:30.718+0000 > >> >> > 7f4bb19ef700 0 [progress WARNING root] complete: ev > >> >> > f9d6c20e-b8d8-4625-b9cf-84da1244c822 does not exist > >> >> > Nov 21 14:56:30 ceph01 bash[3941523]: debug > >> 2023-11-21T14:56:30.718+0000 > >> >> > 7f4bb19ef700 0 [progress WARNING root] complete: ev > >> >> > 1486b26d-2a23-4416-a864-2cbb0ecf1429 does not exist > >> >> > Nov 21 14:56:30 ceph01 bash[3941523]: debug > >> 2023-11-21T14:56:30.718+0000 > >> >> > 7f4bb19ef700 0 [progress WARNING root] complete: ev > >> >> > 7f14d01c-498c-413f-b2ef-05521050190a does not exist > >> >> > Nov 21 14:57:35 ceph01 bash[3941523]: debug > >> 2023-11-21T14:57:35.950+0000 > >> >> > 7f4bb19ef700 0 [progress WARNING root] complete: ev > >> >> > 48cbd97f-82f7-4b80-8086-890fff6e0824 does not exist > >> >> > > >> >> > I tried clearing them but they keep showing up. I am wondering if > >> these > >> >> > missing events can cause memory leaks over time. > >> >> > > >> >> > /Z > >> >> > > >> >> > On Wed, 22 Nov 2023 at 11:12, Eugen Block <eblock@xxxxxx> wrote: > >> >> > > >> >> >> Do you have the full stack trace? The pastebin only contains the > >> >> >> "tcmalloc: large alloc" messages (same as in the tracker issue). > >> Maybe > >> >> >> comment in the tracker issue directly since Radek asked for > someone > >> >> >> with a similar problem in a newer release. > >> >> >> > >> >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> >> >> > >> >> >> > Thanks, Eugen. It is similar in the sense that the mgr is > getting > >> >> >> > OOM-killed. > >> >> >> > > >> >> >> > It started happening in our cluster after the upgrade to > 16.2.14. > >> We > >> >> >> > haven't had this issue with earlier Pacific releases. > >> >> >> > > >> >> >> > /Z > >> >> >> > > >> >> >> > On Tue, 21 Nov 2023, 21:53 Eugen Block, <eblock@xxxxxx> wrote: > >> >> >> > > >> >> >> >> Just checking it on the phone, but isn’t this quite similar? > >> >> >> >> > >> >> >> >> https://tracker.ceph.com/issues/45136 > >> >> >> >> > >> >> >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> >> >> >> > >> >> >> >> > Hi, > >> >> >> >> > > >> >> >> >> > I'm facing a rather new issue with our Ceph cluster: from > time > >> to > >> >> time > >> >> >> >> > ceph-mgr on one of the two mgr nodes gets oom-killed after > >> >> consuming > >> >> >> over > >> >> >> >> > 100 GB RAM: > >> >> >> >> > > >> >> >> >> > [Nov21 15:02] tp_osd_tp invoked oom-killer: > >> >> >> >> > gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, > >> oom_score_adj=0 > >> >> >> >> > [ +0.000010] oom_kill_process.cold+0xb/0x10 > >> >> >> >> > [ +0.000002] [ pid ] uid tgid total_vm rss > >> >> pgtables_bytes > >> >> >> >> > swapents oom_score_adj name > >> >> >> >> > [ +0.000008] > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167 > >> >> >> >> > [ +0.000697] Out of memory: Killed process 3941610 > (ceph-mgr) > >> >> >> >> > total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB, > >> >> >> shmem-rss:0kB, > >> >> >> >> > UID:167 pgtables:260356kB oom_score_adj:0 > >> >> >> >> > [ +6.509769] oom_reaper: reaped process 3941610 (ceph-mgr), > now > >> >> >> >> > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > >> >> >> >> > > >> >> >> >> > The cluster is stable and operating normally, there's nothing > >> >> unusual > >> >> >> >> going > >> >> >> >> > on before, during or after the kill, thus it's unclear what > >> causes > >> >> the > >> >> >> >> mgr > >> >> >> >> > to balloon, use all RAM and get killed. Systemd logs aren't > very > >> >> >> helpful: > >> >> >> >> > they just show normal mgr operations until it fails to > allocate > >> >> memory > >> >> >> >> and > >> >> >> >> > gets killed: https://pastebin.com/MLyw9iVi > >> >> >> >> > > >> >> >> >> > The mgr experienced this issue several times in the last 2 > >> months, > >> >> and > >> >> >> >> the > >> >> >> >> > events don't appear to correlate with any other events in the > >> >> cluster > >> >> >> >> > because basically nothing else happened at around those > times. > >> How > >> >> >> can I > >> >> >> >> > investigate this and figure out what's causing the mgr to > >> consume > >> >> all > >> >> >> >> > memory and get killed? > >> >> >> >> > > >> >> >> >> > I would very much appreciate any advice! > >> >> >> >> > > >> >> >> >> > Best regards, > >> >> >> >> > Zakhar > >> >> >> >> > _______________________________________________ > >> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> >> >> > >> >> >> >> > >> >> >> >> _______________________________________________ > >> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx > >> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> > >> >> > >> >> > >> >> > >> > >> > >> > >> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx