I had similar issues again today. Some users were trying to train a neural network on several million files resulting in enormous cache sizes. Due to my custom cap recall and decay rate settings, the MDSs were able to withstand the load for quite some time, but at some point the active rank crashed taking the whole CephFS down. As usual, the MDS were playing round-robin Russian roulette trying to recover the cache only to be killed by MONs after some time. I tried increasing the beacon grace time, but it didn't help, MONs were still kicking MDSs after what seemed like a random timeout. Even with the setting to wipe the MDS cache on startup, the CephFS was unable to recover. I had to manually delete the mds0_openfiles.* objects from the CephFS metadata pool of which I had a total of 9. Only then was I able to get the MDS back into a working state. I know there are some unreleased patches to improve the MDS behaviour as a result of this thread. Is there any timeline for when those will be available? This issue is rather critical. What I need is a faster cap recall (which got fixed I think, but hasn't been released so far) as well as probably some kind of hard limit after which a client has to release file handles. Right now it can still continue requesting new caps while the MDS is failing to evict sufficiently many old ones. Also, the must be a way to inspect the MDS cache and retire individual clients even when none of the ranks are active (because they are failing to rejoin). Or at least give us a setting for a clean MDS start similar to what deleting the mds0_openfiles objects does. Cheers Janek On 30/08/2019 09:27, Janek Bevendorff wrote: > >> The fix has been merged into master and will be backported soon. > > Amazing, thanks! > > >> I've >> also done testing in a large cluster to confirm the issue you found. >> Using multiple processes to create files as fast as possible in a >> single client reliably reproduced the issue. The MDS cannot recall >> capabilities fast enough when the internal upkeep thread ran every 5 >> seconds. Moving the cache trimming and capability recall to a separate >> thread running every second resolved the issue. >> >> -- >> Patrick Donnelly, Ph.D. >> He / Him / His >> Senior Software Engineer >> Red Hat Sunnyvale, CA >> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx