What does your hardware look like memory-wise? Just for comparison,
one customer cluster has 4,5 GB in use (middle-sized cluster for
openstack, 280 OSDs):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
6077 ceph 20 0 6357560 4,522g 22316 S 12,00 1,797
57022:54 ceph-mgr
In our own cluster (smaller than that and not really heavily used) the
mgr uses almost 2 GB. So those numbers you have seem relatively small.
Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> I've disabled the progress module entirely and will see how it goes.
> Otherwise, mgr memory usage keeps increasing slowly, from past
experience
> it will stabilize at around 1.5-1.6 GB. Other than this event
warning,
it's
> unclear what could have caused random memory ballooning.
>
> /Z
>
> On Wed, 22 Nov 2023 at 13:07, Eugen Block <eblock@xxxxxx> wrote:
>
>> I see these progress messages all the time, I don't think they cause
>> it, but I might be wrong. You can disable it just to rule that out.
>>
>> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>>
>> > Unfortunately, I don't have a full stack trace because there's no
crash
>> > when the mgr gets oom-killed. There's just the mgr log, which
looks
>> > completely normal until about 2-3 minutes before the oom-kill,
when
>> > tmalloc warnings show up.
>> >
>> > I'm not sure that it's the same issue that is described in the
tracker.
>> We
>> > seem to have some stale "events" in the progress module though:
>> >
>> > Nov 21 14:56:30 ceph01 bash[3941523]: debug
2023-11-21T14:56:30.718+0000
>> > 7f4bb19ef700 0 [progress WARNING root] complete: ev
>> > cacc4230-75ee-4892-b8fd-a19fec8f9f66 does not exist
>> > Nov 21 14:56:30 ceph01 bash[3941523]: debug
2023-11-21T14:56:30.718+0000
>> > 7f4bb19ef700 0 [progress WARNING root] complete: ev
>> > 44824331-3f6b-45c4-b925-423d098c3c76 does not exist
>> > Nov 21 14:56:30 ceph01 bash[3941523]: debug
2023-11-21T14:56:30.718+0000
>> > 7f4bb19ef700 0 [progress WARNING root] complete: ev
>> > 0139bc54-ae42-4483-b278-851d77f23f9f does not exist
>> > Nov 21 14:56:30 ceph01 bash[3941523]: debug
2023-11-21T14:56:30.718+0000
>> > 7f4bb19ef700 0 [progress WARNING root] complete: ev
>> > f9d6c20e-b8d8-4625-b9cf-84da1244c822 does not exist
>> > Nov 21 14:56:30 ceph01 bash[3941523]: debug
2023-11-21T14:56:30.718+0000
>> > 7f4bb19ef700 0 [progress WARNING root] complete: ev
>> > 1486b26d-2a23-4416-a864-2cbb0ecf1429 does not exist
>> > Nov 21 14:56:30 ceph01 bash[3941523]: debug
2023-11-21T14:56:30.718+0000
>> > 7f4bb19ef700 0 [progress WARNING root] complete: ev
>> > 7f14d01c-498c-413f-b2ef-05521050190a does not exist
>> > Nov 21 14:57:35 ceph01 bash[3941523]: debug
2023-11-21T14:57:35.950+0000
>> > 7f4bb19ef700 0 [progress WARNING root] complete: ev
>> > 48cbd97f-82f7-4b80-8086-890fff6e0824 does not exist
>> >
>> > I tried clearing them but they keep showing up. I am wondering if
these
>> > missing events can cause memory leaks over time.
>> >
>> > /Z
>> >
>> > On Wed, 22 Nov 2023 at 11:12, Eugen Block <eblock@xxxxxx> wrote:
>> >
>> >> Do you have the full stack trace? The pastebin only contains the
>> >> "tcmalloc: large alloc" messages (same as in the tracker issue).
Maybe
>> >> comment in the tracker issue directly since Radek asked for
someone
>> >> with a similar problem in a newer release.
>> >>
>> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>> >>
>> >> > Thanks, Eugen. It is similar in the sense that the mgr is
getting
>> >> > OOM-killed.
>> >> >
>> >> > It started happening in our cluster after the upgrade to
16.2.14.
We
>> >> > haven't had this issue with earlier Pacific releases.
>> >> >
>> >> > /Z
>> >> >
>> >> > On Tue, 21 Nov 2023, 21:53 Eugen Block, <eblock@xxxxxx> wrote:
>> >> >
>> >> >> Just checking it on the phone, but isn’t this quite similar?
>> >> >>
>> >> >> https://tracker.ceph.com/issues/45136
>> >> >>
>> >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > I'm facing a rather new issue with our Ceph cluster: from
time
to
>> time
>> >> >> > ceph-mgr on one of the two mgr nodes gets oom-killed after
>> consuming
>> >> over
>> >> >> > 100 GB RAM:
>> >> >> >
>> >> >> > [Nov21 15:02] tp_osd_tp invoked oom-killer:
>> >> >> > gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0,
oom_score_adj=0
>> >> >> > [ +0.000010] oom_kill_process.cold+0xb/0x10
>> >> >> > [ +0.000002] [ pid ] uid tgid total_vm rss
>> pgtables_bytes
>> >> >> > swapents oom_score_adj name
>> >> >> > [ +0.000008]
>> >> >> >
>> >> >>
>> >>
>>
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167
>> >> >> > [ +0.000697] Out of memory: Killed process 3941610
(ceph-mgr)
>> >> >> > total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB,
>> >> shmem-rss:0kB,
>> >> >> > UID:167 pgtables:260356kB oom_score_adj:0
>> >> >> > [ +6.509769] oom_reaper: reaped process 3941610
(ceph-mgr), now
>> >> >> > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>> >> >> >
>> >> >> > The cluster is stable and operating normally, there's
nothing
>> unusual
>> >> >> going
>> >> >> > on before, during or after the kill, thus it's unclear what
causes
>> the
>> >> >> mgr
>> >> >> > to balloon, use all RAM and get killed. Systemd logs
aren't very
>> >> helpful:
>> >> >> > they just show normal mgr operations until it fails to
allocate
>> memory
>> >> >> and
>> >> >> > gets killed: https://pastebin.com/MLyw9iVi
>> >> >> >
>> >> >> > The mgr experienced this issue several times in the last 2
months,
>> and
>> >> >> the
>> >> >> > events don't appear to correlate with any other events in
the
>> cluster
>> >> >> > because basically nothing else happened at around those
times.
How
>> >> can I
>> >> >> > investigate this and figure out what's causing the mgr to
consume
>> all
>> >> >> > memory and get killed?
>> >> >> >
>> >> >> > I would very much appreciate any advice!
>> >> >> >
>> >> >> > Best regards,
>> >> >> > Zakhar
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
>> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> >>
>> >> >>
>> >> >> _______________________________________________
>> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> >>
>> >>
>> >>
>> >>
>> >>
>>
>>
>>
>>