Re: MDS internal op exportdir despite ephemeral pinning

Frank Schilder <frans@xxxxxx> · Wed, 23 Nov 2022 09:00:09 +0000

Hi Patrick and everybody,

I wrote a small script that pins the immediate children of 3 sub-dirs on our file system in a round-robin way to our 8 active ranks. I think the experience is worth reporting here. In any case, Patrick, if you can help me get distributed ephemeral pinning to work, this would be great as the automatic pin updates when changing the size of the MDS cluster would simplify life as an admin a lot.

Before starting the script, the load-balancer had created and distributed about 30K sub-trees over the MDSes. Running the script and setting the pins (with a sleep 1 in between) immediately triggered a re-distribution and consolidation of sub-trees. They were consolidated on the MDSes they were pinned to. During this process no health issues. The process took a few minutes to complete.

After that, we ended up with very few sub-trees. Today, the distribution looks like this (ceph tell mds.$mds get subtrees | grep '"path":' | wc -l):

ceph-14: 27
ceph-16: 107
ceph-23: 39
ceph-13: 32
ceph-17: 27
ceph-11: 55
ceph-12: 49
ceph-10: 24

Rank 1 (ceph-16) has a few more pinned to by hand, but these are not very active.

After the sub-tree consolidation completed, there was suddenly very low load on the MDS cluster and the meta-data pools. Also, the MDS daemons went down in CPU load to 5-10% compared with the usual 80-140%.

At first I thought things went bad, but logging in to a client showed there were no problems. I did a standard benchmark and noticed a 3 to 4 times increased single thread IOP/s performance! What I also see is that the MDS cache allocation is very stable now, they need much less RAM compared with before and they don't trash much. No file-system related slow OPS/requests warning in the logs any more! I used to have exportdir/rejoin/behind on trimming a lot, its all gone.

Conclusion: The build-in dynamic load balancer seems to have been responsible for 90-95% of the FS load - completely artificial internal load that was greatly limiting client performance. I think making the internal load balancer much less aggressive would help a lot. Default could be round-robin pin of low-depth sub-dirs and then changing a pin every few hours based on a number of activity metrics over, say 7 days, 1 day and 4 hours to aim for a long-term stable pin distribution.

For example, on our cluster if the most busy 2-3 high-level sub-tree pins are considered for moving every 24h it would be completely sufficient. Also, considering sub-trees very deep in the hierarchy seems pointless. A balancer sub-tree max-depth setting to limit the depth the load balancer looks at would probably improve things. I had a high-level sub-dir distributed over 10K sub-trees, which really didn't help performance at all.

If anyone has the dynamic balancer in action, intentionally or not, it might be worth trying to pin everything up to a depth of 2-3 in the FS tree.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: 19 November 2022 01:52:02
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  MDS internal op exportdir despite ephemeral pinning

On Fri, Nov 18, 2022 at 2:32 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Patrick,
>
> we plan to upgrade next year. Can't do any faster. However, distributed ephemeral pinning was introduced with octopus. It was one of the major new features and is explained in the octopus documentation in detail.
>
> Are you saying that it is actually not implemented?
> If so, how much of the documentation can I trust?

Generally you can trust the documentation. There are configurations
gating these features, as you're aware. While the documentation didn't
say as much, that indicates they are "previews".

> If it is implemented, I would like to get it working - if this is possible at all. Would you still take a look at the data?

I'm willing to look.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx