Hi Frank, Sorry for the delay and thanks for sharing the data privately. On Wed, Nov 23, 2022 at 4:00 AM Frank Schilder <frans@xxxxxx> wrote: > > Hi Patrick and everybody, > > I wrote a small script that pins the immediate children of 3 sub-dirs on our file system in a round-robin way to our 8 active ranks. I think the experience is worth reporting here. In any case, Patrick, if you can help me get distributed ephemeral pinning to work, this would be great as the automatic pin updates when changing the size of the MDS cluster would simplify life as an admin a lot. > > Before starting the script, the load-balancer had created and distributed about 30K sub-trees over the MDSes. Running the script and setting the pins (with a sleep 1 in between) immediately triggered a re-distribution and consolidation of sub-trees. They were consolidated on the MDSes they were pinned to. During this process no health issues. The process took a few minutes to complete. > > After that, we ended up with very few sub-trees. Today, the distribution looks like this (ceph tell mds.$mds get subtrees | grep '"path":' | wc -l): > > ceph-14: 27 > ceph-16: 107 > ceph-23: 39 > ceph-13: 32 > ceph-17: 27 > ceph-11: 55 > ceph-12: 49 > ceph-10: 24 > > Rank 1 (ceph-16) has a few more pinned to by hand, but these are not very active. > > After the sub-tree consolidation completed, there was suddenly very low load on the MDS cluster and the meta-data pools. Also, the MDS daemons went down in CPU load to 5-10% compared with the usual 80-140%. Great! > At first I thought things went bad, but logging in to a client showed there were no problems. I did a standard benchmark and noticed a 3 to 4 times increased single thread IOP/s performance! What I also see is that the MDS cache allocation is very stable now, they need much less RAM compared with before and they don't trash much. No file-system related slow OPS/requests warning in the logs any more! I used to have exportdir/rejoin/behind on trimming a lot, its all gone. > > Conclusion: The build-in dynamic load balancer seems to have been responsible for 90-95% of the FS load - completely artificial internal load that was greatly limiting client performance. I think making the internal load balancer much less aggressive would help a lot. Default could be round-robin pin of low-depth sub-dirs and then changing a pin every few hours based on a number of activity metrics over, say 7 days, 1 day and 4 hours to aim for a long-term stable pin distribution. > > For example, on our cluster if the most busy 2-3 high-level sub-tree pins are considered for moving every 24h it would be completely sufficient. Also, considering sub-trees very deep in the hierarchy seems pointless. A balancer sub-tree max-depth setting to limit the depth the load balancer looks at would probably improve things. I had a high-level sub-dir distributed over 10K sub-trees, which really didn't help performance at all. > > If anyone has the dynamic balancer in action, intentionally or not, it might be worth trying to pin everything up to a depth of 2-3 in the FS tree. Hmm, maybe you forgot to turn on the configs? https://docs.ceph.com/en/octopus/cephfs/multimds/#setting-subtree-partitioning-policies "Both random and distributed ephemeral pin policies are off by default in Octopus. The features may be enabled via the mds_export_ephemeral_random and mds_export_ephemeral_distributed configuration options." Otherwise, maybe you found a bug. I would suggest keeping your round-robin script until you can upgrade to Pacific or Quincy. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx