Ceph MDS OOM in combination with 6.5.1 kernel client

Stefan Kooman <stefan@xxxxxx> · Tue, 19 Sep 2023 10:57:53 +0200

Hi List,

For those of you that are brave enough to run 6.5 CephFS kernel client, 
we are seeing some interesting things happening. Some of this might be 
related to this thread [1]. On a couple of shared webhosting platforms 
we are running CephFS with 6.5.1 kernel. We have disabled 
"workqueue.cpu_intensive_thresh_us=0" (to prevent CephFS events from 
seen as cpu intensive). We have seen two MDS OOM situations after that. 
The MDS allocates ~ 60 GiB of RAM above baseline in ~ 50 seconds. In 
both OOM situations, a little before the OOM happens, there is a spike 
of network traffic going out of the MDS to a kernel client (6.5.1). That 
node gets ~ 700 MiB/s of MDS traffic for also ~ 50 seconds before the 
MDS process gets killed. Nothing is logged about this. Ceph is 
HEALTH_OK, no logging by kernel client or MDS whatsoever. The MDS 
rejoins and is up and active after a couple of minutes. There is no 
increased load on the MDS or the client that explain this (for as far as 
we can see).

At this point I don't expect anyone to tell me based on these symptoms 
what the issue is. But if you encounter similar issues, please update 
this thread. I'm pretty certain we are hitting a bug (or bugs), as the 
MDS should not blow itself up like that in any case (but evict the 
client (that misbehaves?).

Ceph MDS 16.2.11, MDS_MEMORY_TARGET=160GiB.

Gr. Stefan

[1]: 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YR5UNKBOKDHPL2PV4J75ZIUNI4HNMC2W/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx