Re: Random ephemeral pinning, what happens to sub-tree under pin root dir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Patrick,

thanks for your answers. We can't pin the directory above /cephfs/root as it is the root of the ceph-fs itself, which doesn't accept any pinning. Following your explanation and the docs, I'm also not sure what the original/intended use-case for random pinning was/is. To me it makes no sense to have some part pinned and a potentially very large part at the pinning-root unpinned (with a pinning probability of 0.01 and a depth-first walk we talk about an initial sub-tree of depth up to 100 for any descendant; take a full binary tree as a file system - that's a potentially huge unpinned sub-tree hanging at the pin-root).

Setting ephemeral pinning according to something like

  setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
  setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/home/*

will work for us. Are there any stats on the size/depth of sub-trees pinned to the same rank under random ephemeral pinning with access patterns like "depth-first walk", "broad-first walk", and "random-leaf walk"? Similarly, are there actual stats for long-term equilibrium sizes on sub-trees pinned to the same rank under constant random access loads? Kind of any information that would help choosing a reasonable probability value for our home-dir sizes. The practical result of random pinning is kind of unintuitive and it would be great to have some examples with stats.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: Wednesday, December 18, 2024 4:52 AM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Random ephemeral pinning, what happens to sub-tree under pin root dir

On Fri, Dec 13, 2024 at 7:09 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Dear all,
>
> I have a question about random ephemeral pinning that I can't find an answer in the docs to. Question first and some background later. Docs checked for any version from octopus up to latest. Our version for applying random ephemeral pinning is pacific. What I would like to configure on a subtree is this:
>
> Enable random ephemeral pinning at the root of the tree, say, /cephfs/root:
>
>    setfattr -n ceph.dir.pin.random -v 0.0001 /cephfs/root
>
> Will this have the following effects:
>
> A) The root of the tree /cephfs/root is ephemerally pinned to a rank according to a consistent hash of its inode number.

No.

> B) Any descendant sub-directory may be ephemerally pinned 0.01 percent of the time to a rank according to a consistent hash of its inode number.

Yes.

> The important difference to the docs is point A. I don't want to have *any* subdir under the root /cephfs/root *not pinned* to an MDS. The docs only talk about descendant sub-dirs, but the root is here important too because if it is not pinned it will create a large number of unpinned dirfrags that float around with expensive exportdir operations that pinning is there to avoid in the first place.
>
> My questions are:
>
> 1) What does random ephemeral pinning do to the sub-tree root? Is it pinned or not?

Not.

> 2) If it doesn't pin the root, does this work as intended or will it pin everything to rank 1:
>
>    setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/root
>    setfattr -n ceph.dir.pin -v 1 /cephfs/root

That won't work currently but it probably should. I think as an
intermediate solution you could set the export pin on a parent of
"/cephfs/root".

> Background: We use cephfs as a home file system for an HPC cluster and are in exactly in the situation of the example for distributed ephemeral pinning (https://docs.ceph.com/en/latest/cephfs/multimds/?highlight=ephemeral+pin#setting-subtree-partitioning-policies) *except* that the home-dirs of users differ dramatically in size.
>
> We were not pinning at first and this lead to our MDSes go crazy due to the load balancer moving dirfrags around all the time. This "load balancing" was itself responsible for 90-95% (!!!) of the total MDS load. After moving to octopus we simulated distributed ephemeral pinning with a cron job that assigned home-dirs in a round robin fashion to MDS ranks that had the least pins. This immediately calmed down our entire MDS cluster (8 active and 4 stand-by) and user experience improved dramatically. MDS load dropped from 125-150% (idle load!!) to about 20-25% per MDS and memory usage stabilized as well.
>
> The easy way forward would be to replace our manual distribution with distributed ephemeral pinning of /home (in octopus this was experimental, after our recent upgrade to pacific we can use the built-in distribution). However, as stated above, the size of home-dirs differs to a degree that chunking up the file system into equally-sized sub-dir trees would be better than distributing entire home dir trees over ranks. Users with very large sub-trees might get spread out over more than one rank.
>
> This is what random ephemeral pinning seems to be there for and I would like to chunk our entire filesystem up into sub-trees of size 10000-100000 directory fragments and distribute these over the MDSes. However, this only works if the root and with it the first sub-tree is also pinned. Note that this is not a problem with distributed ephemeral pinning, because this policy pins *all* *immediate* children of the pin root and, therefore, does not create free-floating directory fragments.
>
> I would be grateful if someone could shed light on the question whether or not the pin root of random ephemeral pinning is itself pinned or not.

You could do both distributed and random:

setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/home/*

You'd need to set the random pin whenever a new user directory is
created but that's probably acceptable? The advantage is that you'd
get a default "pretty good" distribution across ranks and then for
really large user directories it would split as you would expect.

Thanks for sharing your use-case.

--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux