Re: [Ceph-users] Re: MDS failing under load with large cache sizes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey Patrick,

I just wanted to give you some feedback about how 14.2.5 is working for me. I've had the chance to test it for a day now and overall, the experience is much better, although not perfect (perhaps far from it).

I have two active MDS (I figured that'd spread the meta data load a little and seems to work pretty well for me). After the upgrade to the new release, I removed all special recall settings, so my MDS config is basically on default. The only thing I set is a mds_max_caps_per_client of 200k, a mds_cache_reservation of 0.1 and 40G of mds_cache_memory_limit.

Right now, everything seems to be running smoothly, although I notice that the max cap setting isn't fully honoured. The overall cache size seems fairly constant at 15M (for mds.0, mds.1 a little less), but the client cap count can easily exceed 10M if I run something like `find` on a large directory.

We have one particularly problematic folder containing about 400 sub folders holding a total of about 35M files among them. My first attempts at running `find -type d` on those had the weird effect that after pretty much exactly 2M caps, mds.1 got killed and replaced by a standby. Fortunately, the standby managed to take over in a matter of seconds (sometimes up to a few minutes) resetting the cap count to about 5k. The same thing then happened again once the new MDS reached the magical 2M caps. I would suppose that this is still the same problem as before, but with the huge improvement that the take-over standby MDS can actually recover. Previously, it would just die the same way after a minute or two of futile recovery attempts and the FS would be down indefinitely until I delete the openfiles object.

Right now, I cannot reproduce the crash any more---the caps to surge to 10-15M, but no crash. However, I keep seeing the dreaded "client failing to respond to cache pressure" message occasionally. So far, though, the MDS have been able to keep up and reduce the number of caps after about 15M, though, so that the message disappears after a while and the cap count growth isn't entirely unbounded. I ran a `find -type d` on the most problematic folder and attached two perf dumps for you (current cap count on the client: 14660568):

https://pastebin.com/W2dVJiW0
https://pastebin.com/pzQ5uQQ3

Cheers
Janek

P.S. Just as I was finishing this email, the rank 0 MDS actually crashed. Unfortunately, I didn't have increased debug levels enabled, so its death note is rather uninformative:

2019-12-17 09:42:12.325 7f7633dde700  1 mds.deltaweb011 Updating MDS map to version 103112 from mon.3 2019-12-17 09:43:27.774 7f7633dde700  1 mds.deltaweb011 Updating MDS map to version 103113 from mon.3 2019-12-17 09:43:40.086 7f7633dde700  1 mds.deltaweb011 Updating MDS map to version 103114 from mon.3
2019-12-17 09:44:46.203 7f7633dde700 -1 *** Caught signal (Aborted) **
 in thread 7f7633dde700 thread_name:ms_dispatch

Also, this time around the recovery appears to be a lot more problematic, so I'm afraid I have to apply the previous procedure again of deleting the openfiles object to get it back up. I don't think my `find` alone would have crashed the MDS, but if another client is doing similar things at the same time, it overloads the MDS.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux