Ian Kent <raven@xxxxxxxxxx> writes: > On Fri, 2021-06-04 at 09:28 -0500, Eric W. Biederman wrote: >> Ian Kent <raven@xxxxxxxxxx> writes: >> >> > On Thu, 2021-06-03 at 17:02 -0500, Eric W. Biederman wrote: >> > > Miklos Szeredi <miklos@xxxxxxxxxx> writes: >> > > >> > > > On Thu, 3 Jun 2021 at 19:26, Eric W. Biederman < >> > > > ebiederm@xxxxxxxxxxxx> wrote: >> > > > > >> > > > > Ian Kent <raven@xxxxxxxxxx> writes: >> > > > > >> > > > > > If there are many lookups for non-existent paths these >> > > > > > negative >> > > > > > lookups >> > > > > > can lead to a lot of overhead during path walks. >> > > > > > >> > > > > > The VFS allows dentries to be created as negative and >> > > > > > hashed, >> > > > > > and caches >> > > > > > them so they can be used to reduce the fairly high overhead >> > > > > > alloc/free >> > > > > > cycle that occurs during these lookups. >> > > > > > >> > > > > > Signed-off-by: Ian Kent <raven@xxxxxxxxxx> >> > > > > > --- >> > > > > > fs/kernfs/dir.c | 55 +++++++++++++++++++++++++++++++++-- >> > > > > > ---- >> > > > > > ---------------- >> > > > > > 1 file changed, 33 insertions(+), 22 deletions(-) >> > > > > > >> > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c >> > > > > > index 4c69e2af82dac..5151c712f06f5 100644 >> > > > > > --- a/fs/kernfs/dir.c >> > > > > > +++ b/fs/kernfs/dir.c >> > > > > > @@ -1037,12 +1037,33 @@ static int >> > > > > > kernfs_dop_revalidate(struct >> > > > > > dentry *dentry, unsigned int flags) >> > > > > > if (flags & LOOKUP_RCU) >> > > > > > return -ECHILD; >> > > > > > >> > > > > > - /* Always perform fresh lookup for negatives */ >> > > > > > - if (d_really_is_negative(dentry)) >> > > > > > - goto out_bad_unlocked; >> > > > > > + mutex_lock(&kernfs_mutex); >> > > > > > >> > > > > > kn = kernfs_dentry_node(dentry); >> > > > > > - mutex_lock(&kernfs_mutex); >> > > > > >> > > > > Why bring kernfs_dentry_node inside the mutex? >> > > > > >> > > > > The inode lock of the parent should protect negative to >> > > > > positive >> > > > > transitions not the kernfs_mutex. So moving the code inside >> > > > > the mutex looks unnecessary and confusing. >> > > > >> > > > Except that d_revalidate() may or may not be called with parent >> > > > lock >> > > > held. >> > >> > Bringing the kernfs_dentry_node() inside taking the mutex is >> > probably >> > wasteful, as you say, oddly the reason I did it that conceptually >> > it >> > makes sense to me since the kernfs node is being grabbed. But it >> > probably isn't possible for a concurrent unlink so is not >> > necessary. >> > >> > Since you feel strongly about I can change it. >> > >> > > >> > > I grant that this works because kernfs_io_lookup today holds >> > > kernfs_mutex over d_splice_alias. >> > >> > Changing that will require some thought but your points about >> > maintainability are well taken. >> > >> > > >> > > The problem is that the kernfs_mutex only should be protecting >> > > the >> > > kernfs data structures not the vfs data structures. >> > > >> > > Reading through the code history that looks like a hold over from >> > > when >> > > sysfs lived in the dcache before it was reimplemented as a >> > > distributed >> > > file system. So it was probably a complete over sight and >> > > something >> > > that did not matter. >> > > >> > > The big problem is that if the code starts depending upon the >> > > kernfs_mutex (or the kernfs_rwsem) to provide semantics the rest >> > > of >> > > the >> > > filesystems does not the code will diverge from the rest of the >> > > filesystems and maintenance will become much more difficult. >> > > >> > > Diverging from other filesystems and becoming a maintenance pain >> > > has >> > > already been seen once in the life of sysfs and I don't think we >> > > want >> > > to >> > > go back there. >> > > >> > > Further extending the scope of lock, when the problem is that the >> > > locking is causing problems seems like the opposite of the >> > > direction >> > > we >> > > want the code to grow. >> > > >> > > I really suspect all we want kernfs_dop_revalidate doing for >> > > negative >> > > dentries is something as simple as comparing the timestamp of the >> > > negative dentry to the timestamp of the parent dentry, and if the >> > > timestamp has changed perform the lookup. That is roughly what >> > > nfs does today with negative dentries. >> > > >> > > The dentry cache will always lag the kernfs_node data structures, >> > > and >> > > that is fundamental. We should take advantage of that to make >> > > the >> > > code >> > > as simple and as fast as we can not to perform lots of work that >> > > creates >> > > overhead. >> > > >> > > Plus the kernfs data structures should not change much so I >> > > expect >> > > there will be effectively 0 penalty in always performing the >> > > lookup >> > > of a >> > > negative dentry when the directory itself has changed. >> > >> > This sounds good to me. >> > >> > In fact this approach should be able to be used to resolve the >> > potential race Miklos pointed out in a much simpler way, not to >> > mention the revalidate simplification itself. >> > >> > But isn't knowing whether the directory has changed harder to >> > do than checking a time stamp? >> > >> > Look at kernfs_refresh_inode() and it's callers for example. >> > >> > I suspect that would require bringing back the series patch to use >> > a generation number to identify directory changes (and also getting >> > rid of the search in revalidate). >> >> In essence it is a simple as looking at a sequence number or a >> timestamp >> to detect the directory has changed. > > Yes, both Miklos and Al suggested using a simple revision to detect > changes to the parent. I did that early on and I don't think I grokked > what Al recommended and ended up with something more complex than was > needed. So I dropped it because I wanted to keep the changes to a > minimum. > > But a quick test, bringing that patch back, and getting rid of the > search in revalidate works well. It's as effective at eliminating > contention I saw with d_alloc_parallel() for the case of a lot of > deterministic accesses to the same non-existent file as the racy > search method I had there, perhaps a bit better, it's certainly > more straight forward. > >> >> In practice there are always details that make things more >> complicated. >> >> I was actually wondering if the approach should be to have an seqlock >> around an individual directories rbtree. I think that would give a >> lot >> of potential for rcu style optimization during lookups. > > Yeah, it's tempting, but another constraint I had is to not increase > the size of the kernfs_node struct (Greg and Tejun) and there's a > hole in the node union variant kernfs_elem_dir at least big enough > for sizeof(pointer) so I can put the revision there. And, given the > simplification in revalidate, as well as that extra code being pretty > straight forward itself, it's not too bad from the minimal change > POV. > > So I'd like to go with using a revision for now. No objection from me. Eric