Re: [Ceph-users] Re: MDS failing under load with large cache sizes

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Tue, 17 Dec 2019 09:56:14 +0100

Hey Patrick,

I just wanted to give you some feedback about how 14.2.5 is working for 
me. I've had the chance to test it for a day now and overall, the 
experience is much better, although not perfect (perhaps far from it).

I have two active MDS (I figured that'd spread the meta data load a 
little and seems to work pretty well for me). After the upgrade to the 
new release, I removed all special recall settings, so my MDS config is 
basically on default. The only thing I set is a mds_max_caps_per_client 
of 200k, a mds_cache_reservation of 0.1 and 40G of mds_cache_memory_limit.

Right now, everything seems to be running smoothly, although I notice 
that the max cap setting isn't fully honoured. The overall cache size 
seems fairly constant at 15M (for mds.0, mds.1 a little less), but the 
client cap count can easily exceed 10M if I run something like `find` on 
a large directory.

We have one particularly problematic folder containing about 400 sub 
folders holding a total of about 35M files among them. My first attempts 
at running `find -type d` on those had the weird effect that after 
pretty much exactly 2M caps, mds.1 got killed and replaced by a standby. 
Fortunately, the standby managed to take over in a matter of seconds 
(sometimes up to a few minutes) resetting the cap count to about 5k. The 
same thing then happened again once the new MDS reached the magical 2M 
caps. I would suppose that this is still the same problem as before, but 
with the huge improvement that the take-over standby MDS can actually 
recover. Previously, it would just die the same way after a minute or 
two of futile recovery attempts and the FS would be down indefinitely 
until I delete the openfiles object.

Right now, I cannot reproduce the crash any more---the caps to surge to 
10-15M, but no crash. However, I keep seeing the dreaded "client failing 
to respond to cache pressure" message occasionally. So far, though, the 
MDS have been able to keep up and reduce the number of caps after about 
15M, though, so that the message disappears after a while and the cap 
count growth isn't entirely unbounded. I ran a `find -type d` on the 
most problematic folder and attached two perf dumps for you (current cap 
count on the client: 14660568):

https://pastebin.com/W2dVJiW0
https://pastebin.com/pzQ5uQQ3

Cheers
Janek

P.S. Just as I was finishing this email, the rank 0 MDS actually 
crashed. Unfortunately, I didn't have increased debug levels enabled, so 
its death note is rather uninformative:

2019-12-17 09:42:12.325 7f7633dde700  1 mds.deltaweb011 Updating MDS map 
to version 103112 from mon.3
2019-12-17 09:43:27.774 7f7633dde700  1 mds.deltaweb011 Updating MDS map 
to version 103113 from mon.3
2019-12-17 09:43:40.086 7f7633dde700  1 mds.deltaweb011 Updating MDS map 
to version 103114 from mon.3
2019-12-17 09:44:46.203 7f7633dde700 -1 *** Caught signal (Aborted) **
 in thread 7f7633dde700 thread_name:ms_dispatch

Also, this time around the recovery appears to be a lot more 
problematic, so I'm afraid I have to apply the previous procedure again 
of deleting the openfiles object to get it back up. I don't think my 
`find` alone would have crashed the MDS, but if another client is doing 
similar things at the same time, it overloads the MDS.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx