Re: [Ceph-users] Re: MDS failing under load with large cache sizes

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Thu, 5 Dec 2019 19:30:35 +0100

I had similar issues again today. Some users were trying to train a
neural network on several million files resulting in enormous cache
sizes. Due to my custom cap recall and decay rate settings, the MDSs
were able to withstand the load for quite some time, but at some point
the active rank crashed taking the whole CephFS down.

As usual, the MDS were playing round-robin Russian roulette trying to
recover the cache only to be killed by MONs after some time. I tried
increasing the beacon grace time, but it didn't help, MONs were still
kicking MDSs after what seemed like a random timeout. Even with the
setting to wipe the MDS cache on startup, the CephFS was unable to
recover. I had to manually delete the mds0_openfiles.* objects from the
CephFS metadata pool of which I had a total of 9. Only then was I able
to get the MDS back into a working state.

I know there are some unreleased patches to improve the MDS behaviour as
a result of this thread. Is there any timeline for when those will be
available? This issue is rather critical. What I need is a faster cap
recall (which got fixed I think, but hasn't been released so far) as
well as probably some kind of hard limit after which a client has to
release file handles. Right now it can still continue requesting new
caps while the MDS is failing to evict sufficiently many old ones. Also,
the must be a way to inspect the MDS cache and retire individual clients
even when none of the ranks are active (because they are failing to
rejoin). Or at least give us a setting for a clean MDS start similar to
what deleting the mds0_openfiles objects does.

Cheers
Janek

On 30/08/2019 09:27, Janek Bevendorff wrote:
>
>> The fix has been merged into master and will be backported soon.
>
> Amazing, thanks!
>
>
>> I've
>> also done testing in a large cluster to confirm the issue you found.
>> Using multiple processes to create files as fast as possible in a
>> single client reliably reproduced the issue. The MDS cannot recall
>> capabilities fast enough when the internal upkeep thread ran every 5
>> seconds. Moving the cache trimming and capability recall to a separate
>> thread running every second resolved the issue.
>>
>> -- 
>> Patrick Donnelly, Ph.D.
>> He / Him / His
>> Senior Software Engineer
>> Red Hat Sunnyvale, CA
>> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx