On Sun, 28 Jan 2024 12:53:31 -0800 Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > > > > I have to understand how the dentry lookup works. Basically, when the > > ei gets deleted, it can't be freed until all dentries it references > > (including its children) are no longer being accessed. Does that lookup > > get called only when a dentry with the name doesn't already exist? > > Dentry lookup gets called with the parent inode locked for reading, so > a lookup can happen in parallel with readdir and other dentry lookup. > > BUT. > > Each dentry is also "self-serializing", so you will never see a lookup > on the same name (in the same directory) concurrently. The above is what I wanted to know. > > The implementation is very much internal to the VFS layer, and it's > all kinds of nasty, with a new potential lookup waiting for the old > one, verifying that the old one is still usable, and maybe repeating > it all until we find a successful previous lookup or we're the only > dentry remaining. > > It's nasty code that is very much in the "Al Viro" camp, but the point > is that any normal filesystem should treat lookups as being concurrent > with non-creation events, but not concurrently multiples. > > There *is* some common code with "atomic_open()", where filesystems > that implement that then want to know if it's the *first* lookup, or a > use of a previously looked up dentry, and they'll use the > "d_in_lookup()" thing to determine that. So this whole "keep track of > which dentries are *currently* being looked up is actually exposed, > but any normal filesystem should never care. > > But if you think you have that issue (tracefs does not), you really > want to talk to Al extensively. > > > That is, can I assume that eventfs_root_lookup() is only called when > > the VFS file system could not find an existing dentry and it has to > > create one? > > Correct. For any _particular_ name, you should think of lookup as serialized. > > > If that's the case, then I can remove the ei->dentry and just add a ref > > counter that it was accessed. Then the final dput() should call > > eventfs_set_ei_status_free() (I hate that name and need to change it), > > and if the ei->is_freed is set, it can free the ei. > > Note that the final 'dput()' will happen *after* the dentry has been > removed, so what can happen is > > lookup("name", d1); > ... lookup successful, dentry is used .. > ... dentry at some point has no more users .. > ... memory pressure prunes unused dentries .. > ... dentry gets unhashed and is no longer visible .. > lookup("name", d2); > ... new dentry is created .. > final dput(d1); > .. old dentry - that wasn't accessible any more is freed .. Actually I was mistaken. I'm looking at the final iput() not dput(). > > and this is actually one of the *reasons* that virtual filesystems > must not try to cache dentry pointers in their internal data > structures. Because note how the fuilesystem saw the new lookup(d2) of > the same name *before* it saw the >d_release(d1) of the old dentry. > > And the above is fundamental: we obviously cannot call '->d_release()' > until the old dentry is all dead and buried (lockref_mark_dead() etc), > so pretty much by definition you'll have that ordering being possible. > > It's extremely unlikely, of course. I'll be you'll never hit it in testing. > > So if if you associate some internal data structure with a dentry, > just *what* happens when you haven't been told abotu the old dentry > being dead when the new one happens? > > See why I say that it's fundamentally wrong for a filesystem to try to > track dentries? All the operations that can use a dentry will get one > passed down to them by the VFS layer. The filesystem has no business > trying to remember some dentry from a previous operation, and the > filesystem *will* get it wrong. > > But also note how refcounting works fine. In fact, refcounting is > pretty much the *only* thing that works fine. So what you *should* do > is > > - at lookup(), when you save your filesystem data in "->d_fsdata", > you increment a refcount > > - at ->d_release(), you decrement a refcount > > and now you're fine. Yes, when the above (very very unlikely) > situation happens, you'll temporarily have a refcount incremented > twice, but that's kind of the *point* of refcounts. > > Side note: this is pretty much true of any kernel data structure. If > you have a kernel data structure that isn't just used within one > thread, it must be refcounted. But it'as *doubly* true when you save > references to something that the VFS maintains, because you DO NOT > CONTROL the lifetime of that entity. > > > The eventfs_remove_dir() can free the ei (after SRCU) if it has no > > references, otherwise it needs to wait for the final dput() to do the > > free. > > Honestly, you should just *always* do refcounting. No "free after RCU > delay" as an alternative. Just refcount it. > > Now, the RCU delay may be needed if the lookup of said structure > happens under RCU, but no, saying "I use SRCU to make sure the > lifetime is at least X" is just broken. > > The refcount is what gives the lifetime. Any form of RCU-delaying > should then be purely about non-refcounting RCU lookups that may > happen as the thing is dying (and said lookup should *look* at the > refcount and say "oh, this is dead, I'm not returning this". > > > I think the ei->dentry was required for the dir wrapper logic that we > > removed. > > I think all of this was due to the bogus readdir that created dentries > willy-nilly and without the required serialization. > > And that was all horribly broken. It wasn't even the above kind of > "really subtle race that you'll never hit in practice" broken. It was > just absolutely broken with readdir and lookup racing on the same name > and creating an unholy dentry mess. Hmm, if I understand the above, I could get rid of keeping around dentry and even remove the eventfs_set_ei_status_free(). I can try something to see if it works. -- Steve