Re: Random ephemeral pinning, what happens to sub-tree under pin root dir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 13, 2024 at 7:09 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Dear all,
>
> I have a question about random ephemeral pinning that I can't find an answer in the docs to. Question first and some background later. Docs checked for any version from octopus up to latest. Our version for applying random ephemeral pinning is pacific. What I would like to configure on a subtree is this:
>
> Enable random ephemeral pinning at the root of the tree, say, /cephfs/root:
>
>    setfattr -n ceph.dir.pin.random -v 0.0001 /cephfs/root
>
> Will this have the following effects:
>
> A) The root of the tree /cephfs/root is ephemerally pinned to a rank according to a consistent hash of its inode number.

No.

> B) Any descendant sub-directory may be ephemerally pinned 0.01 percent of the time to a rank according to a consistent hash of its inode number.

Yes.

> The important difference to the docs is point A. I don't want to have *any* subdir under the root /cephfs/root *not pinned* to an MDS. The docs only talk about descendant sub-dirs, but the root is here important too because if it is not pinned it will create a large number of unpinned dirfrags that float around with expensive exportdir operations that pinning is there to avoid in the first place.
>
> My questions are:
>
> 1) What does random ephemeral pinning do to the sub-tree root? Is it pinned or not?

Not.

> 2) If it doesn't pin the root, does this work as intended or will it pin everything to rank 1:
>
>    setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/root
>    setfattr -n ceph.dir.pin -v 1 /cephfs/root

That won't work currently but it probably should. I think as an
intermediate solution you could set the export pin on a parent of
"/cephfs/root".

> Background: We use cephfs as a home file system for an HPC cluster and are in exactly in the situation of the example for distributed ephemeral pinning (https://docs.ceph.com/en/latest/cephfs/multimds/?highlight=ephemeral+pin#setting-subtree-partitioning-policies) *except* that the home-dirs of users differ dramatically in size.
>
> We were not pinning at first and this lead to our MDSes go crazy due to the load balancer moving dirfrags around all the time. This "load balancing" was itself responsible for 90-95% (!!!) of the total MDS load. After moving to octopus we simulated distributed ephemeral pinning with a cron job that assigned home-dirs in a round robin fashion to MDS ranks that had the least pins. This immediately calmed down our entire MDS cluster (8 active and 4 stand-by) and user experience improved dramatically. MDS load dropped from 125-150% (idle load!!) to about 20-25% per MDS and memory usage stabilized as well.
>
> The easy way forward would be to replace our manual distribution with distributed ephemeral pinning of /home (in octopus this was experimental, after our recent upgrade to pacific we can use the built-in distribution). However, as stated above, the size of home-dirs differs to a degree that chunking up the file system into equally-sized sub-dir trees would be better than distributing entire home dir trees over ranks. Users with very large sub-trees might get spread out over more than one rank.
>
> This is what random ephemeral pinning seems to be there for and I would like to chunk our entire filesystem up into sub-trees of size 10000-100000 directory fragments and distribute these over the MDSes. However, this only works if the root and with it the first sub-tree is also pinned. Note that this is not a problem with distributed ephemeral pinning, because this policy pins *all* *immediate* children of the pin root and, therefore, does not create free-floating directory fragments.
>
> I would be grateful if someone could shed light on the question whether or not the pin root of random ephemeral pinning is itself pinned or not.

You could do both distributed and random:

setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/home/*

You'd need to set the random pin whenever a new user directory is
created but that's probably acceptable? The advantage is that you'd
get a default "pretty good" distribution across ranks and then for
really large user directories it would split as you would expect.

Thanks for sharing your use-case.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux