Re: Provide more documentation for MDS performance tuning on large file systems

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Mon, 7 Dec 2020 09:42:45 +0100

Thanks, Dan!

I have played with many thresholds, including the decay rates. It is 
indeed very difficult to assess their effects, since our workloads 
differ widely depending on what people are working on at the moment. I 
would need to develop a proper benchmarking suite to simulate the 
different heavy workloads we have.

We currently run with all those options scaled up 6x the defaults, and
we almost never have caps recall warnings these days, with a couple
thousand cephfs clients.

Under normal operation, we don't either. We had issues in the past with 
Ganesha and still do sometimes, but that's a bug in Ganesha and we don't 
really use it for anything but legacy clients any way. Usually, recall 
works flawlessly, unless some client suddenly starts doing crazy shit. 
We have just a few clients who regularly keep tens of thousands of caps 
open and had I not limited the number, it would be hundreds of 
thousands. Recalling them without threatening stability is not trivial 
and at the least it degrades the performance for everybody else. Any 
pointers here to better handling this situation are greatly appreciated. 
I will definitely try your config recommendations.

2. A user running VSCodium, keeping 15k caps open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel)
client won't release them. Stopping Codium seems to be the only way to
release.

As I said, 15k is not much for us. The limits right now are 64k per 
client and a few hit that limit quite regularly. One of those clients is 
our VPN gateway, which, technically, is not a single client, but to the 
CephFS it looks like one due to source NAT. This is certainly something 
I want to tune further, so that clients are routed directly via their 
private IP instead of being NAT'ed. The other ones are our GPU deep 
learning servers (just three of them, but they can generate astounding 
numbers of iops) and the 135-node Hadoop cluster (which is hard to 
sustain for any single machine, so we prefer to use the S3 here).

Otherwise, 4GB is normally sufficient in our env for
mds_cache_memory_limit (3 active MDSs), however this is highly
workload dependent. If several clients are actively taking 100s of
thousands of caps, then the 4GB MDS needs to be ultra busy recalling
caps and latency increases. We saw this live a couple weeks ago: a few
users started doing intensive rsyncs, and some other users noticed an
MD latency increase; it was fixed immediately just by increasing the
mem limit to 8GB.

So you too have 3 active MDSs? Are you using directory pinning? We have 
a very deep and unbalanced directory structure, so I cannot really pin 
any top-level directory without skewing the load massively. From my 
experience, three MDSs without explicit pinning aren't much better or 
even worse than one. But perhaps you have different observations?

I agree some sort of tuning best practises should all be documented
somehow, even though it's complex and rather delicate.

Indeed!

Janek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx