Re: Provide more documentation for MDS performance tuning on large file systems

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Sat, 5 Dec 2020 15:25:30 +0100

Hi Janek,

My understanding is that the recall thresholds (see my list below)
should be scaled proportionally. OTOH, I haven't played with the decay
rates (and don't know if there's any significant value to tuning
those).

We have a recall tuning script that we use to deploy different factors
whenever there are caps recall issues:

X=$1
echo Scaling MDS Recall by ${X}x
ceph tell mds.* injectargs -- --mds_recall_max_decay_threshold
$((X*16*1024)) --mds_recall_max_caps $((X*5000))
--mds_recall_global_max_decay_threshold $((X*64*1024))
--mds_recall_warning_threshold $((X*32*1024))
--mds_cache_trim_threshold $((X*64*1024))

We currently run with all those options scaled up 6x the defaults, and
we almost never have caps recall warnings these days, with a couple
thousand cephfs clients.

In the past month I've seen 2 different cases of a client not
releasing caps even with these options:
1. A user had ceph-fuse mounted /cephfs/ on top of a 2nd ceph-fuse
/cephfs. The outer (i.e lower) mountpoint/process had several thousand
caps that could never be released until the user cleaned up their
mounts.
2. A user running VSCodium, keeping 15k caps open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel)
client won't release them. Stopping Codium seems to be the only way to
release.

Otherwise, 4GB is normally sufficient in our env for
mds_cache_memory_limit (3 active MDSs), however this is highly
workload dependent. If several clients are actively taking 100s of
thousands of caps, then the 4GB MDS needs to be ultra busy recalling
caps and latency increases. We saw this live a couple weeks ago: a few
users started doing intensive rsyncs, and some other users noticed an
MD latency increase; it was fixed immediately just by increasing the
mem limit to 8GB.

I agree some sort of tuning best practises should all be documented
somehow, even though it's complex and rather delicate.

-- Dan

On Sat, Jan 25, 2020 at 5:54 PM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
>
> Hello,
>
> Over the last week I have tried optimising the performance of our MDS
> nodes for the large amount of files and concurrent clients we have. It
> turns out that despite various stability fixes in recent releases, the
> default configuration still doesn't appear to be optimal for keeping the
> cache size under control and avoid intermittent I/O blocks.
>
> Unfortunately, it is very hard to tweak the configuration to something
> that works, because the tuning parameters needed are largely
> undocumented or only described in very technical terms in the source
> code making them quite unapproachable for administrators not familiar
> with all the CephFS internals. I would therefore like to ask if it were
> possible to document the "advanced" MDS settings more clearly as to what
> they do and in what direction they have to be tuned for more or less
> aggressive cap recall, for instance (sometimes it is not clear if a
> threshold is a min or a max threshold).
>
> I am am in the very (un)fortunate situation to have folders with a
> several 100K direct sub folders or files (and one extreme case with
> almost 7 million dentries), which is a pretty good benchmark for
> measuring cap growth while performing operations on them. For the time
> being, I came up with this configuration, which seems to work for me,
> but is still far from optimal:
>
> mds         basic    mds_cache_memory_limit              10737418240
> mds         advanced mds_cache_trim_threshold            131072
> mds         advanced mds_max_caps_per_client             500000
> mds         advanced mds_recall_max_caps                 17408
> mds         advanced mds_recall_max_decay_rate           2.000000
>
> The parameters I am least sure about---because I understand the least
> how they actually work---are mds_cache_trim_threshold and
> mds_recall_max_decay_rate. Despite reading the description in
> src/common/options.cc, I understand only half of what they're doing and
> I am also not quite sure in which direction to tune them for optimal
> results.
>
> Another point where I am struggling is the correct configuration of
> mds_recall_max_caps. The default of 5K doesn't work too well for me, but
> values above 20K also don't seem to be a good choice. While high values
> result in fewer blocked ops and better performance without destabilising
> the MDS, they also lead to slow but unbounded cache growth, which seems
> counter-intuitive. 17K was the maximum I could go. Higher values work
> for most use cases, but when listing very large folders with millions of
> dentries, the MDS cache size slowly starts to exceed the limit after a
> few hours, since the MDSs are failing to keep clients below
> mds_max_caps_per_client despite not showing any "failing to respond to
> cache pressure" warnings.
>
> With the configuration above, I do not have cache size issues any more,
> but it comes at the cost of performance and slow/blocked ops. A few
> hints as to how I could optimise my settings for better client
> performance would be much appreciated and so would be additional
> documentation for all the "advanced" MDS settings.
>
> Thanks a lot
> Janek
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx