Re: Ceph 16.2.14: ceph-mgr getting oom-killed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I see these progress messages all the time, I don't think they cause it, but I might be wrong. You can disable it just to rule that out.

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

Unfortunately, I don't have a full stack trace because there's no crash
when the mgr gets oom-killed. There's just the mgr log, which looks
completely normal until about 2-3 minutes before the oom-kill, when
tmalloc warnings show up.

I'm not sure that it's the same issue that is described in the tracker. We
seem to have some stale "events" in the progress module though:

Nov 21 14:56:30 ceph01 bash[3941523]: debug 2023-11-21T14:56:30.718+0000
7f4bb19ef700  0 [progress WARNING root] complete: ev
cacc4230-75ee-4892-b8fd-a19fec8f9f66 does not exist
Nov 21 14:56:30 ceph01 bash[3941523]: debug 2023-11-21T14:56:30.718+0000
7f4bb19ef700  0 [progress WARNING root] complete: ev
44824331-3f6b-45c4-b925-423d098c3c76 does not exist
Nov 21 14:56:30 ceph01 bash[3941523]: debug 2023-11-21T14:56:30.718+0000
7f4bb19ef700  0 [progress WARNING root] complete: ev
0139bc54-ae42-4483-b278-851d77f23f9f does not exist
Nov 21 14:56:30 ceph01 bash[3941523]: debug 2023-11-21T14:56:30.718+0000
7f4bb19ef700  0 [progress WARNING root] complete: ev
f9d6c20e-b8d8-4625-b9cf-84da1244c822 does not exist
Nov 21 14:56:30 ceph01 bash[3941523]: debug 2023-11-21T14:56:30.718+0000
7f4bb19ef700  0 [progress WARNING root] complete: ev
1486b26d-2a23-4416-a864-2cbb0ecf1429 does not exist
Nov 21 14:56:30 ceph01 bash[3941523]: debug 2023-11-21T14:56:30.718+0000
7f4bb19ef700  0 [progress WARNING root] complete: ev
7f14d01c-498c-413f-b2ef-05521050190a does not exist
Nov 21 14:57:35 ceph01 bash[3941523]: debug 2023-11-21T14:57:35.950+0000
7f4bb19ef700  0 [progress WARNING root] complete: ev
48cbd97f-82f7-4b80-8086-890fff6e0824 does not exist

I tried clearing them but they keep showing up. I am wondering if these
missing events can cause memory leaks over time.

/Z

On Wed, 22 Nov 2023 at 11:12, Eugen Block <eblock@xxxxxx> wrote:

Do you have the full stack trace? The pastebin only contains the
"tcmalloc: large alloc" messages (same as in the tracker issue). Maybe
comment in the tracker issue directly since Radek asked for someone
with a similar problem in a newer release.

Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:

> Thanks, Eugen. It is similar in the sense that the mgr is getting
> OOM-killed.
>
> It started happening in our cluster after the upgrade to 16.2.14. We
> haven't had this issue with earlier Pacific releases.
>
> /Z
>
> On Tue, 21 Nov 2023, 21:53 Eugen Block, <eblock@xxxxxx> wrote:
>
>> Just checking it on the phone, but isn’t this quite similar?
>>
>> https://tracker.ceph.com/issues/45136
>>
>> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>>
>> > Hi,
>> >
>> > I'm facing a rather new issue with our Ceph cluster: from time to time
>> > ceph-mgr on one of the two mgr nodes gets oom-killed after consuming
over
>> > 100 GB RAM:
>> >
>> > [Nov21 15:02] tp_osd_tp invoked oom-killer:
>> > gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
>> > [  +0.000010]  oom_kill_process.cold+0xb/0x10
>> > [  +0.000002] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
>> > swapents oom_score_adj name
>> > [  +0.000008]
>> >
>>
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167
>> > [  +0.000697] Out of memory: Killed process 3941610 (ceph-mgr)
>> > total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB,
shmem-rss:0kB,
>> > UID:167 pgtables:260356kB oom_score_adj:0
>> > [  +6.509769] oom_reaper: reaped process 3941610 (ceph-mgr), now
>> > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>> >
>> > The cluster is stable and operating normally, there's nothing unusual
>> going
>> > on before, during or after the kill, thus it's unclear what causes the
>> mgr
>> > to balloon, use all RAM and get killed. Systemd logs aren't very
helpful:
>> > they just show normal mgr operations until it fails to allocate memory
>> and
>> > gets killed: https://pastebin.com/MLyw9iVi
>> >
>> > The mgr experienced this issue several times in the last 2 months, and
>> the
>> > events don't appear to correlate with any other events in the cluster
>> > because basically nothing else happened at around those times. How
can I
>> > investigate this and figure out what's causing the mgr to consume all
>> > memory and get killed?
>> >
>> > I would very much appreciate any advice!
>> >
>> > Best regards,
>> > Zakhar
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>






_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux