Re: MDS stuck ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Venky and Patrick,

thanks Venky for your explanation. Now I understand, hopefully. The difference is that:

- octopus: every immediate child of /home gets an individual directory fragment that is pinned
- pacific: every immediate (not recursive) directory fragment of /home gets an individual pin

So, say I have a /home with 550 dirs, octopus would create 550 directory fragments and pin these against ranks. Pacific would look at /home, find x fragments and pin each fragment against a rank, meaning that all sub-dirs in this fragment are pinned against the same rank.

This seems to make quite some sense as with 550 dirs and 8 MDSes I would have to pin about 69-69 dirs to the same rank anyways. It seems to make more sense to pin a fragment with 68-69 dirs in one go. Unless, of course, you have all the active users in one fragment.

This is where I wonder how the fragmentation works. There is the parameter mds_bal_split_size (minimum size of directory fragment before splitting) with default 10000, which is a bit larger than 550. My naive expectation would be that /home is exactly one fragment and, therefore, everything lands on one MDS only. So, there must be something extra at work here that subdivides such a small directory. I would expect that at least as many fragments are created as there are ranks.

> I'm not sure why that happened, especially when there are no explicit
> pins set for sub-directories. Maybe Patrick has an explanation.

The answer is very embarrassing, I simply didn't read far enough down in the documentation to see that I need to set a flag to true to enable that feature. My bad.

However, this was probably not in vain. After manually pinning directories, I'm making some very interesting workload-performance related observations that I would like to share with you. It might help improve on load distribution in general. To this end, it would be great to have a vague idea what the difference between manual pinning of immediate children of /home is versus the version where fragments are pinned. I guess I pinned sub-dirs on the same fragment to different ranks, something that will not happen with either of the distributed pin modes.

Since I control the dir pins myself, I can play around with pins and do some manual load balancing. While doing so, I discovered 2 fundamentally different workloads on our cluster, one that works really well with how an MDS caches data and one that leads to promoting and evicting one-access items all the time and the cache mem allocation being at the limit. I'm trying to collect a bit more data before reporting back to this thread.

Thanks for both your help, patience and explanations!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Venky Shankar <vshankar@xxxxxxxxxx>
Sent: 30 November 2022 07:45
To: Frank Schilder
Cc: Reed Dier; ceph-users; Patrick Donnelly
Subject: Re:  Re: MDS stuck ops

Hi Frank,

CC Patrick.

On Tue, Nov 29, 2022 at 8:58 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Venky,
>
> thanks for taking the time. I'm afraid I still don't get the difference. Maybe the ceph dev terminology means something else than what I use. Let's look at this statement, I think it summarises my misery quite well:
>
> > It's an implementation difference. In octopus, each child dir (direct
> > descendent of the ephemeral pinned directory) is pinned to a target
> > MDS based on the hash of its (child dir) inode number. From pacific
> > onwards, the dirfrags are distributed across ranks. This limits the
> > number of subtrees.
>
> Let's say we have /home/{a..c} and I enable ephemeral pinning on /home. Let's also say that each of /home/{a..c} have a number of directory fragments, maybe somewhere deeper down in the hierarchy.

I think this is where there is some misunderstanding in terminology. A
directory in ceph file system undergoes fragmentation when the number
of directory entries (files+dir count under the directory) exceeds a
certain threshold. When this happens, the MDS "splits" the directory
into fragments (dirfrag). At the object level, the MDS creates
multiple objects for the directory. Each object stores a subset of
directory entries. E.g. - objects for a split directory (0x10000000000
is the inode number):

10000000000.00000000
10000000000.02000000
10000000000.03600000
10000000000.03400000
10000000000.02c00000
10000000000.02800000

The directory entries are stored as omap entries (key-value pairs) in
each object.

> As far as I understand it, ephemeral distributed pinning means that a static pin based on a hash function is assigned to each of /home/{a..c}, which, in turn, is then inherited by all of their child directories. Meaning that all directories under /home/a/ have the same effective static pin as /home/a and likewise for /home/b/... and /home/c/...

That's right - /home/a can be pinned to rank 1, /home/b to rank 2,
/home/c to rank 0. Unless there is an explicit pin for (say)
/home/a/<>/<>/<>/xyz, this sub-directory should be pinned to where
/home/a is currently pinned.

>
> To me, this implies that any directory fragment that is a descendent of /home/a is also pinned to the same MDS as /home/a.

> I really don't understand what the difference between "each child dir (direct descendent of the ephemeral pinned directory) is pinned to a target MDS" (octopus) and "the dirfrags are distributed across ranks" (pacific) is. In other words, if /home/a is assigned a rank pin and all of its descendants inherit this rank pin, how can any directory fragment of (a descendant of) /home/a end up on an MDS that is different than the one assigned to /home/a?

Now that the term "dirfrag" is hopefully clear, let say a directory
has a distributed ephemeral pin set and has 100 sub-directories. In
octopus, the MDS treats each of these 100 directories as a subtree and
distributes (pins) them across MDSs. While in pacific (and later
releases), the MDS distributes each "dirfrag" across MDSs. Note that
this limits the number of subtrees that get created. Since each
dirfrag holds a subset of directory entries, the number of subtrees
that the MDS has to track is way less with this approach. Also, note
that the MDS is very aggressive in fragmenting a directory when a
distributed ephemeral pin is set - way before the file+dir threshold
is reached.

>
> What I observed is that /home/a/.../xyz and /home/a/..../uvw ended up on different ranks and none of the descriptions I have seen so far give an explanation for why this is expected. All explanations I have seen state that these should be on the same MDS in both, octopus and pacific.

I'm not sure why that happened, especially when there are no explicit
pins set for sub-directories. Maybe Patrick has an explanation.

>
> It would be great if you could help me out here. Maybe it really is just terminology?
>
> Thanks a lot for your time again!

HTH.

> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Venky Shankar <vshankar@xxxxxxxxxx>
> Sent: 29 November 2022 15:54:12
> To: Frank Schilder
> Cc: Reed Dier; ceph-users
> Subject: Re:  Re: MDS stuck ops
>
> Hi Frank,
>
> On Tue, Nov 29, 2022 at 5:38 PM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Hi Venky,
> >
> > maybe you can help me clarifying the situation a bit. I don't understand the difference between the two pinning implementations you describe in your reply and I also don't see any difference in meaning in the documentation between octopus and quicy, the difference is just in wording. Both texts state that "all of a directory’s immediate children should be ephemerally pinned" (octopus) and "This has the effect of distributing immediate children across a range of MDS ranks" (quincy).
> >
> > To me, both mean that, if I enable distributed ephemeral pinning on /home, then for every child /home/X of home it follows that /home/X and any directory under /home/X/ are pinned to the same MDS rank. Meaning their information in cache exists on this rank only and no other MDS is serving requests for any of these directories.
> >
> > Is there something wrong with this interpretation?
>
> Distributed ephemeral pins will distribute immediate children across a
> range of MDS ranks - /home/X might be on rank 1, /home/Y on rank 2,
> /home/Z on rank 0, and so on.
>
> >
> > I tried it with octopus and the cache for directories under /home/X/ was all over the place. Nothing was pinned to a single rank and on top of that the number of sub-trees was extremely unevenly assigned and excessively large. After I set an explicit pin on every child /home/X of /home, only then was all cache information about all subdirs of /home/X/ handled by the MDS I pinned it to.
>
> The directories (children) are spread across MDSs based on the
> (consistent) hash of its inode number. The distribution should be
> uniform across ranks.
>
> >
> > What should the result of distributed ephemeral pinning actually be when set on /home?
> > What would be different between octopus and quincy?
>
> It's an implementation difference. In octopus, each child dir (direct
> descendent of the ephemeral pinned directory) is pinned to a target
> MDS based on the hash of its (child dir) inode number. From pacific
> onwards, the dirfrags are distributed across ranks. This limits the
> number of subtrees.
>
> > Is the documentation (for octopus) misleading or does the implementation not match documentation?
>
> I think the docs are fine - quincy docs do mention that the directory
> fragments are distributed while the octopus docs do not. I agree, the
> wordings are a bit subtle.
>
> >
> > Thanks for any insight!
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Venky Shankar <vshankar@xxxxxxxxxx>
> > Sent: 29 November 2022 10:09:21
> > To: Frank Schilder
> > Cc: Reed Dier; ceph-users
> > Subject: Re:  Re: MDS stuck ops
> >
> > On Tue, Nov 29, 2022 at 1:42 PM Frank Schilder <frans@xxxxxx> wrote:
> > >
> > > Hi Venky.
> > >
> > > > You most likely ran into performance issues with distributed ephemeral
> > > > pins with octopus. It'd be nice to try out one of the latest releases
> > > > for this.
> > >
> > > I run into the problem that distributed ephemeral pinning seems not actually implemented in octopus. This mode didn't pin anything, see also the recent conversation with Patrick:
> >
> > Distributed ephemeral pins used to distribute inodes under a directory
> > mongst MDSs which had scalability issues due to the sheer number of
> > subtrees. This was changed to distribute dirfrags and I think those
> > changes were not in octopus.
> >
> > >
> > > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YEB34F5SREAOOMATOKC6NO3G2GVCSOOZ
> > >
> > > I sent him a couple of dumps, but am not sure if he is doing anything with it. I wrote a small script to do the distributed pinning by hand and it solved all sorts of problems.
> >
> > Distributing dirfrags solved a lot of scalability issues and those
> > changes are available in pacific and beyond. We aren't backporting to
> > octopus anymore, so the options are limited.
> >
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> >
> >
> > --
> > Cheers,
> > Venky
> >
>
>
> --
> Cheers,
> Venky
>


--
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux