Re: Provide more documentation for MDS performance tuning on large file systems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks, Dan!

I have played with many thresholds, including the decay rates. It is indeed very difficult to assess their effects, since our workloads differ widely depending on what people are working on at the moment. I would need to develop a proper benchmarking suite to simulate the different heavy workloads we have.

We currently run with all those options scaled up 6x the defaults, and
we almost never have caps recall warnings these days, with a couple
thousand cephfs clients.

Under normal operation, we don't either. We had issues in the past with Ganesha and still do sometimes, but that's a bug in Ganesha and we don't really use it for anything but legacy clients any way. Usually, recall works flawlessly, unless some client suddenly starts doing crazy shit. We have just a few clients who regularly keep tens of thousands of caps open and had I not limited the number, it would be hundreds of thousands. Recalling them without threatening stability is not trivial and at the least it degrades the performance for everybody else. Any pointers here to better handling this situation are greatly appreciated. I will definitely try your config recommendations.

2. A user running VSCodium, keeping 15k caps open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel)
client won't release them. Stopping Codium seems to be the only way to
release.

As I said, 15k is not much for us. The limits right now are 64k per client and a few hit that limit quite regularly. One of those clients is our VPN gateway, which, technically, is not a single client, but to the CephFS it looks like one due to source NAT. This is certainly something I want to tune further, so that clients are routed directly via their private IP instead of being NAT'ed. The other ones are our GPU deep learning servers (just three of them, but they can generate astounding numbers of iops) and the 135-node Hadoop cluster (which is hard to sustain for any single machine, so we prefer to use the S3 here).

Otherwise, 4GB is normally sufficient in our env for
mds_cache_memory_limit (3 active MDSs), however this is highly
workload dependent. If several clients are actively taking 100s of
thousands of caps, then the 4GB MDS needs to be ultra busy recalling
caps and latency increases. We saw this live a couple weeks ago: a few
users started doing intensive rsyncs, and some other users noticed an
MD latency increase; it was fixed immediately just by increasing the
mem limit to 8GB.

So you too have 3 active MDSs? Are you using directory pinning? We have a very deep and unbalanced directory structure, so I cannot really pin any top-level directory without skewing the load massively. From my experience, three MDSs without explicit pinning aren't much better or even worse than one. But perhaps you have different observations?


I agree some sort of tuning best practises should all be documented
somehow, even though it's complex and rather delicate.

Indeed!


Janek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux