Re: Ceph 16.2.14: ceph-mgr getting oom-killed

Eugen Block <eblock@xxxxxx> · Wed, 22 Nov 2023 09:12:08 +0000

Do you have the full stack trace? The pastebin only contains the  
"tcmalloc: large alloc" messages (same as in the tracker issue). Maybe  
comment in the tracker issue directly since Radek asked for someone  
with a similar problem in a newer release.

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

Thanks, Eugen. It is similar in the sense that the mgr is getting
OOM-killed.

It started happening in our cluster after the upgrade to 16.2.14. We
haven't had this issue with earlier Pacific releases.

/Z

On Tue, 21 Nov 2023, 21:53 Eugen Block, <eblock@xxxxxx> wrote:

Just checking it on the phone, but isn’t this quite similar?

https://tracker.ceph.com/issues/45136

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

> Hi,
>
> I'm facing a rather new issue with our Ceph cluster: from time to time
> ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over
> 100 GB RAM:
>
> [Nov21 15:02] tp_osd_tp invoked oom-killer:
> gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> [  +0.000010]  oom_kill_process.cold+0xb/0x10
> [  +0.000002] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
> swapents oom_score_adj name
> [  +0.000008]
>
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167
> [  +0.000697] Out of memory: Killed process 3941610 (ceph-mgr)
> total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB, shmem-rss:0kB,
> UID:167 pgtables:260356kB oom_score_adj:0
> [  +6.509769] oom_reaper: reaped process 3941610 (ceph-mgr), now
> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> The cluster is stable and operating normally, there's nothing unusual
going
> on before, during or after the kill, thus it's unclear what causes the
mgr
> to balloon, use all RAM and get killed. Systemd logs aren't very helpful:
> they just show normal mgr operations until it fails to allocate memory
and
> gets killed: https://pastebin.com/MLyw9iVi
>
> The mgr experienced this issue several times in the last 2 months, and
the
> events don't appear to correlate with any other events in the cluster
> because basically nothing else happened at around those times. How can I
> investigate this and figure out what's causing the mgr to consume all
> memory and get killed?
>
> I would very much appreciate any advice!
>
> Best regards,
> Zakhar
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx