Ceph MDS OOM in combination with 6.5.1 kernel client

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Hi List,

For those of you that are brave enough to run 6.5 CephFS kernel client, we are seeing some interesting things happening. Some of this might be related to this thread [1]. On a couple of shared webhosting platforms we are running CephFS with 6.5.1 kernel. We have disabled "workqueue.cpu_intensive_thresh_us=0" (to prevent CephFS events from seen as cpu intensive). We have seen two MDS OOM situations after that. The MDS allocates ~ 60 GiB of RAM above baseline in ~ 50 seconds. In both OOM situations, a little before the OOM happens, there is a spike of network traffic going out of the MDS to a kernel client (6.5.1). That node gets ~ 700 MiB/s of MDS traffic for also ~ 50 seconds before the MDS process gets killed. Nothing is logged about this. Ceph is HEALTH_OK, no logging by kernel client or MDS whatsoever. The MDS rejoins and is up and active after a couple of minutes. There is no increased load on the MDS or the client that explain this (for as far as we can see).

At this point I don't expect anyone to tell me based on these symptoms what the issue is. But if you encounter similar issues, please update this thread. I'm pretty certain we are hitting a bug (or bugs), as the MDS should not blow itself up like that in any case (but evict the client (that misbehaves?).

Ceph MDS 16.2.11, MDS_MEMORY_TARGET=160GiB.

Gr. Stefan

[1]: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YR5UNKBOKDHPL2PV4J75ZIUNI4HNMC2W/
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux