Hi Frank, CC Patrick. On Tue, Nov 29, 2022 at 8:58 PM Frank Schilder <frans@xxxxxx> wrote: > > Hi Venky, > > thanks for taking the time. I'm afraid I still don't get the difference. Maybe the ceph dev terminology means something else than what I use. Let's look at this statement, I think it summarises my misery quite well: > > > It's an implementation difference. In octopus, each child dir (direct > > descendent of the ephemeral pinned directory) is pinned to a target > > MDS based on the hash of its (child dir) inode number. From pacific > > onwards, the dirfrags are distributed across ranks. This limits the > > number of subtrees. > > Let's say we have /home/{a..c} and I enable ephemeral pinning on /home. Let's also say that each of /home/{a..c} have a number of directory fragments, maybe somewhere deeper down in the hierarchy. I think this is where there is some misunderstanding in terminology. A directory in ceph file system undergoes fragmentation when the number of directory entries (files+dir count under the directory) exceeds a certain threshold. When this happens, the MDS "splits" the directory into fragments (dirfrag). At the object level, the MDS creates multiple objects for the directory. Each object stores a subset of directory entries. E.g. - objects for a split directory (0x10000000000 is the inode number): 10000000000.00000000 10000000000.02000000 10000000000.03600000 10000000000.03400000 10000000000.02c00000 10000000000.02800000 The directory entries are stored as omap entries (key-value pairs) in each object. > As far as I understand it, ephemeral distributed pinning means that a static pin based on a hash function is assigned to each of /home/{a..c}, which, in turn, is then inherited by all of their child directories. Meaning that all directories under /home/a/ have the same effective static pin as /home/a and likewise for /home/b/... and /home/c/... That's right - /home/a can be pinned to rank 1, /home/b to rank 2, /home/c to rank 0. Unless there is an explicit pin for (say) /home/a/<>/<>/<>/xyz, this sub-directory should be pinned to where /home/a is currently pinned. > > To me, this implies that any directory fragment that is a descendent of /home/a is also pinned to the same MDS as /home/a. > I really don't understand what the difference between "each child dir (direct descendent of the ephemeral pinned directory) is pinned to a target MDS" (octopus) and "the dirfrags are distributed across ranks" (pacific) is. In other words, if /home/a is assigned a rank pin and all of its descendants inherit this rank pin, how can any directory fragment of (a descendant of) /home/a end up on an MDS that is different than the one assigned to /home/a? Now that the term "dirfrag" is hopefully clear, let say a directory has a distributed ephemeral pin set and has 100 sub-directories. In octopus, the MDS treats each of these 100 directories as a subtree and distributes (pins) them across MDSs. While in pacific (and later releases), the MDS distributes each "dirfrag" across MDSs. Note that this limits the number of subtrees that get created. Since each dirfrag holds a subset of directory entries, the number of subtrees that the MDS has to track is way less with this approach. Also, note that the MDS is very aggressive in fragmenting a directory when a distributed ephemeral pin is set - way before the file+dir threshold is reached. > > What I observed is that /home/a/.../xyz and /home/a/..../uvw ended up on different ranks and none of the descriptions I have seen so far give an explanation for why this is expected. All explanations I have seen state that these should be on the same MDS in both, octopus and pacific. I'm not sure why that happened, especially when there are no explicit pins set for sub-directories. Maybe Patrick has an explanation. > > It would be great if you could help me out here. Maybe it really is just terminology? > > Thanks a lot for your time again! HTH. > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Venky Shankar <vshankar@xxxxxxxxxx> > Sent: 29 November 2022 15:54:12 > To: Frank Schilder > Cc: Reed Dier; ceph-users > Subject: Re: Re: MDS stuck ops > > Hi Frank, > > On Tue, Nov 29, 2022 at 5:38 PM Frank Schilder <frans@xxxxxx> wrote: > > > > Hi Venky, > > > > maybe you can help me clarifying the situation a bit. I don't understand the difference between the two pinning implementations you describe in your reply and I also don't see any difference in meaning in the documentation between octopus and quicy, the difference is just in wording. Both texts state that "all of a directory’s immediate children should be ephemerally pinned" (octopus) and "This has the effect of distributing immediate children across a range of MDS ranks" (quincy). > > > > To me, both mean that, if I enable distributed ephemeral pinning on /home, then for every child /home/X of home it follows that /home/X and any directory under /home/X/ are pinned to the same MDS rank. Meaning their information in cache exists on this rank only and no other MDS is serving requests for any of these directories. > > > > Is there something wrong with this interpretation? > > Distributed ephemeral pins will distribute immediate children across a > range of MDS ranks - /home/X might be on rank 1, /home/Y on rank 2, > /home/Z on rank 0, and so on. > > > > > I tried it with octopus and the cache for directories under /home/X/ was all over the place. Nothing was pinned to a single rank and on top of that the number of sub-trees was extremely unevenly assigned and excessively large. After I set an explicit pin on every child /home/X of /home, only then was all cache information about all subdirs of /home/X/ handled by the MDS I pinned it to. > > The directories (children) are spread across MDSs based on the > (consistent) hash of its inode number. The distribution should be > uniform across ranks. > > > > > What should the result of distributed ephemeral pinning actually be when set on /home? > > What would be different between octopus and quincy? > > It's an implementation difference. In octopus, each child dir (direct > descendent of the ephemeral pinned directory) is pinned to a target > MDS based on the hash of its (child dir) inode number. From pacific > onwards, the dirfrags are distributed across ranks. This limits the > number of subtrees. > > > Is the documentation (for octopus) misleading or does the implementation not match documentation? > > I think the docs are fine - quincy docs do mention that the directory > fragments are distributed while the octopus docs do not. I agree, the > wordings are a bit subtle. > > > > > Thanks for any insight! > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Venky Shankar <vshankar@xxxxxxxxxx> > > Sent: 29 November 2022 10:09:21 > > To: Frank Schilder > > Cc: Reed Dier; ceph-users > > Subject: Re: Re: MDS stuck ops > > > > On Tue, Nov 29, 2022 at 1:42 PM Frank Schilder <frans@xxxxxx> wrote: > > > > > > Hi Venky. > > > > > > > You most likely ran into performance issues with distributed ephemeral > > > > pins with octopus. It'd be nice to try out one of the latest releases > > > > for this. > > > > > > I run into the problem that distributed ephemeral pinning seems not actually implemented in octopus. This mode didn't pin anything, see also the recent conversation with Patrick: > > > > Distributed ephemeral pins used to distribute inodes under a directory > > mongst MDSs which had scalability issues due to the sheer number of > > subtrees. This was changed to distribute dirfrags and I think those > > changes were not in octopus. > > > > > > > > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YEB34F5SREAOOMATOKC6NO3G2GVCSOOZ > > > > > > I sent him a couple of dumps, but am not sure if he is doing anything with it. I wrote a small script to do the distributed pinning by hand and it solved all sorts of problems. > > > > Distributing dirfrags solved a lot of scalability issues and those > > changes are available in pacific and beyond. We aren't backporting to > > octopus anymore, so the options are limited. > > > > > > > > Best regards, > > > ================= > > > Frank Schilder > > > AIT Risø Campus > > > Bygning 109, rum S14 > > > > > > > > > -- > > Cheers, > > Venky > > > > > -- > Cheers, > Venky > -- Cheers, Venky _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx