On Fri, 2021-06-04 at 07:57 +0800, Ian Kent wrote: > On Thu, 2021-06-03 at 10:15 +0800, Ian Kent wrote: > > On Wed, 2021-06-02 at 18:57 +0800, Ian Kent wrote: > > > On Wed, 2021-06-02 at 10:58 +0200, Miklos Szeredi wrote: > > > > On Wed, 2 Jun 2021 at 05:44, Ian Kent <raven@xxxxxxxxxx> wrote: > > > > > > > > > > On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote: > > > > > > On Fri, 28 May 2021 at 08:34, Ian Kent <raven@xxxxxxxxxx> > > > > > > wrote: > > > > > > > > > > > > > > If there are many lookups for non-existent paths these > > > > > > > negative > > > > > > > lookups > > > > > > > can lead to a lot of overhead during path walks. > > > > > > > > > > > > > > The VFS allows dentries to be created as negative and > > > > > > > hashed, > > > > > > > and > > > > > > > caches > > > > > > > them so they can be used to reduce the fairly high > > > > > > > overhead > > > > > > > alloc/free > > > > > > > cycle that occurs during these lookups. > > > > > > > > > > > > Obviously there's a cost associated with negative caching > > > > > > too. > > > > > > For > > > > > > normal filesystems it's trivially worth that cost, but in > > > > > > case > > > > > > of > > > > > > kernfs, not sure... > > > > > > > > > > > > Can "fairly high" be somewhat substantiated with a > > > > > > microbenchmark > > > > > > for > > > > > > negative lookups? > > > > > > > > > > Well, maybe, but anything we do for a benchmark would be > > > > > totally > > > > > artificial. > > > > > > > > > > The reason I added this is because I saw appreciable > > > > > contention > > > > > on the dentry alloc path in one case I saw. > > > > > > > > If multiple tasks are trying to look up the same negative > > > > dentry > > > > in > > > > parallel, then there will be contention on the parent inode > > > > lock. > > > > Was this the issue? This could easily be reproduced with an > > > > artificial benchmark. > > > > > > Not that I remember, I'll need to dig up the sysrq dumps to have > > > a > > > look and get back to you. > > > > After doing that though I could grab Fox Chen's reproducer and give > > it varying sysfs paths as well as some percentage of non-existent > > sysfs paths and see what I get ... > > > > That should give it a more realistic usage profile and, if I can > > get the percentage of non-existent paths right, demonstrate that > > case as well ... but nothing is easy, so we'll have to wait and > > see, ;) > > Ok, so I grabbed Fox's benckmark repo. and used a non-existent path > to check the negative dentry contention. > > I've taken the baseline readings and the contention is see is the > same as I originally saw. It's with d_alloc_parallel() on lockref. > > While I haven't run the patched check I'm pretty sure that using > dget_parent() and taking a snapshot will move the contention to > that. So if I do retain the negative dentry caching change I would > need to use the dentry seq lock for it to be useful. > > Thoughts Miklos, anyone? Mmm ... never mind, I'd still need to take a snapshot anyway and dget_parent() looks lightweight if there's no conflict. I will need to test it. > > > > > > > > > > > > > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > > > > > > > index 4c69e2af82dac..5151c712f06f5 100644 > > > > > > > --- a/fs/kernfs/dir.c > > > > > > > +++ b/fs/kernfs/dir.c > > > > > > > @@ -1037,12 +1037,33 @@ static int > > > > > > > kernfs_dop_revalidate(struct > > > > > > > dentry *dentry, unsigned int flags) > > > > > > > if (flags & LOOKUP_RCU) > > > > > > > return -ECHILD; > > > > > > > > > > > > > > - /* Always perform fresh lookup for negatives */ > > > > > > > - if (d_really_is_negative(dentry)) > > > > > > > - goto out_bad_unlocked; > > > > > > > + mutex_lock(&kernfs_mutex); > > > > > > > > > > > > > > kn = kernfs_dentry_node(dentry); > > > > > > > - mutex_lock(&kernfs_mutex); > > > > > > > + > > > > > > > + /* Negative hashed dentry? */ > > > > > > > + if (!kn) { > > > > > > > + struct kernfs_node *parent; > > > > > > > + > > > > > > > + /* If the kernfs node can be found this > > > > > > > is > > > > > > > a > > > > > > > stale > > > > > > > negative > > > > > > > + * hashed dentry so it must be discarded > > > > > > > and > > > > > > > the > > > > > > > lookup redone. > > > > > > > + */ > > > > > > > + parent = kernfs_dentry_node(dentry- > > > > > > > > d_parent); > > > > > > > > > > > > This doesn't look safe WRT a racing sys_rename(). In this > > > > > > case > > > > > > d_move() is called only with parent inode locked, but not > > > > > > with > > > > > > kernfs_mutex while ->d_revalidate() may not have parent > > > > > > inode > > > > > > locked. > > > > > > After d_move() the old parent dentry can be freed, > > > > > > resulting > > > > > > in > > > > > > use > > > > > > after free. Easily fixed by dget_parent(). > > > > > > > > > > Umm ... I'll need some more explanation here ... > > > > > > > > > > We are in ref-walk mode so the parent dentry isn't going > > > > > away. > > > > > > > > The parent that was used to lookup the dentry in __d_lookup() > > > > isn't > > > > going away. But it's not necessarily equal to dentry->d_parent > > > > anymore. > > > > > > > > > And this is a negative dentry so rename is going to bail out > > > > > with ENOENT way early. > > > > > > > > You are right. But note that negative dentry in question could > > > > be > > > > the > > > > target of a rename. Current implementation doesn't switch the > > > > target's parent or name, but this wasn't always the case > > > > (commit > > > > 076515fc9267 ("make non-exchanging __d_move() copy ->d_parent > > > > rather > > > > than swap them")), so a backport of this patch could become > > > > incorrect > > > > on old enough kernels. > > > > > > Right, that __lookup_hash() will find the negative target. > > > > > > > > > > > So I still think using dget_parent() is the correct way to do > > > > this. > > > > > > The rename code does my head in, ;) > > > > > > The dget_parent() would ensure we had an up to date parent so > > > yes, that would be the right thing to do regardless. > > > > > > But now I'm not sure that will be sufficient for kernfs. I'm > > > still > > > thinking about it. > > > > > > I'm wondering if there's a missing check in there to account for > > > what happens with revalidate after ->rename() but before move. > > > There's already a kernfs node check in there so it's probably ok > > > ... > > > > > > > > > > > > > > > > > > > > + if (parent) { > > > > > > > + const void *ns = NULL; > > > > > > > + > > > > > > > + if (kernfs_ns_enabled(parent)) > > > > > > > + ns = kernfs_info(dentry- > > > > > > > > d_sb)- > > > > > > > > ns; > > > > > > > + kn = kernfs_find_ns(parent, > > > > > > > dentry- > > > > > > > > d_name.name, ns); > > > > > > > > > > > > Same thing with d_name. There's > > > > > > take_dentry_name_snapshot()/release_dentry_name_snapshot() > > > > > > to > > > > > > properly > > > > > > take care of that. > > > > > > > > > > I don't see that problem either, due to the dentry being > > > > > negative, > > > > > but please explain what your seeing here. > > > > > > > > Yeah. Negative dentries' names weren't always stable, but that > > > > was > > > > a > > > > long time ago (commit 8d85b4845a66 ("Allow sharing external > > > > names > > > > after __d_move()")). > > > > > > Right, I'll make that change too. > > > > > > > > > > > Thanks, > > > > Miklos > > > > > >