Re: Provide more documentation for MDS performance tuning on large file systems

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 7 Dec 2020 10:18:00 +0100

On Mon, Dec 7, 2020 at 9:42 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
>
> Thanks, Dan!
>
> I have played with many thresholds, including the decay rates. It is
> indeed very difficult to assess their effects, since our workloads
> differ widely depending on what people are working on at the moment. I
> would need to develop a proper benchmarking suite to simulate the
> different heavy workloads we have.
>
> > We currently run with all those options scaled up 6x the defaults, and
> > we almost never have caps recall warnings these days, with a couple
> > thousand cephfs clients.
>
> Under normal operation, we don't either. We had issues in the past with
> Ganesha and still do sometimes, but that's a bug in Ganesha and we don't
> really use it for anything but legacy clients any way. Usually, recall

+1 we've seen that ganesha issue; it simply won't release caps ever,
even with the latest fixes in this area.

> works flawlessly, unless some client suddenly starts doing crazy shit.
> We have just a few clients who regularly keep tens of thousands of caps
> open and had I not limited the number, it would be hundreds of
> thousands. Recalling them without threatening stability is not trivial
> and at the least it degrades the performance for everybody else. Any
> pointers here to better handling this situation are greatly appreciated.
> I will definitely try your config recommendations.
>
> > 2. A user running VSCodium, keeping 15k caps open.. the opportunistic
> > caps recall eventually starts recalling those but the (el7 kernel)
> > client won't release them. Stopping Codium seems to be the only way to
> > release.
>
> As I said, 15k is not much for us. The limits right now are 64k per
> client and a few hit that limit quite regularly. One of those clients is

What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 50000, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.

This 15k caps client I mentioned is not related to the max caps per
client config. In recent nautilus, the MDS will proactively recall
caps from idle clients -- so a client with even just a few caps like
this can provoke the caps recall warnings (if it is buggy, like in
this case). The client doesn't cause any real problems, just the
annoying warnings.

So what I'm looking for now is a way to disable proactively recalling
if the num caps is below some threshold -- `min_caps_per_client` might
do this but I haven't tested yet.

> our VPN gateway, which, technically, is not a single client, but to the
> CephFS it looks like one due to source NAT. This is certainly something
> I want to tune further, so that clients are routed directly via their
> private IP instead of being NAT'ed. The other ones are our GPU deep
> learning servers (just three of them, but they can generate astounding
> numbers of iops) and the 135-node Hadoop cluster (which is hard to
> sustain for any single machine, so we prefer to use the S3 here).
>
> > Otherwise, 4GB is normally sufficient in our env for
> > mds_cache_memory_limit (3 active MDSs), however this is highly
> > workload dependent. If several clients are actively taking 100s of
> > thousands of caps, then the 4GB MDS needs to be ultra busy recalling
> > caps and latency increases. We saw this live a couple weeks ago: a few
> > users started doing intensive rsyncs, and some other users noticed an
> > MD latency increase; it was fixed immediately just by increasing the
> > mem limit to 8GB.
>
> So you too have 3 active MDSs? Are you using directory pinning? We have
> a very deep and unbalanced directory structure, so I cannot really pin
> any top-level directory without skewing the load massively. From my
> experience, three MDSs without explicit pinning aren't much better or
> even worse than one. But perhaps you have different observations?

Yes 3 active today, and lots of pinning thanks to our flat hierarchy.
User dirs are pinned to one of three randomly, as are the manila
shares.
MD balancer = on creates a disaster in our env -- too much ping pong
of dirs between the MDSs, too much metadata IO needed to keep up, not
to mention "nice export" bugs in the past that forced us to disable
the balancer to begin with.

We used to have 10 active MDSs, but that is such a pain during
upgrades that we're now trying with just three. Next upgrade we'll
probably leave it at one for a while to see if that suffices.

Multi-active + pinning definitely increases the overall MD throughput
(once you can get the relevant inodes cached), because as you know the
MDS is single threaded and CPU bound at the limit.
We could get something like 4-5k handle_client_requests out of a
single MDS, and that really does scale horizontally as you add MDSs
(and pin).

Cheers, Dan

>
>
> > I agree some sort of tuning best practises should all be documented
> > somehow, even though it's complex and rather delicate.
>
> Indeed!
>
>
> Janek
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx