For those of you that are brave enough to run 6.5 CephFS kernel client,
we are seeing some interesting things happening. Some of this might be
related to this thread . On a couple of shared webhosting platforms
we are running CephFS with 6.5.1 kernel. We have disabled
"workqueue.cpu_intensive_thresh_us=0" (to prevent CephFS events from
seen as cpu intensive). We have seen two MDS OOM situations after that.
The MDS allocates ~ 60 GiB of RAM above baseline in ~ 50 seconds. In
both OOM situations, a little before the OOM happens, there is a spike
of network traffic going out of the MDS to a kernel client (6.5.1). That
node gets ~ 700 MiB/s of MDS traffic for also ~ 50 seconds before the
MDS process gets killed. Nothing is logged about this. Ceph is
HEALTH_OK, no logging by kernel client or MDS whatsoever. The MDS
rejoins and is up and active after a couple of minutes. There is no
increased load on the MDS or the client that explain this (for as far as
we can see).
At this point I don't expect anyone to tell me based on these symptoms
what the issue is. But if you encounter similar issues, please update
this thread. I'm pretty certain we are hitting a bug (or bugs), as the
MDS should not blow itself up like that in any case (but evict the
client (that misbehaves?).
Ceph MDS 16.2.11, MDS_MEMORY_TARGET=160GiB.
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx