On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > > I have to understand how the dentry lookup works. Basically, when the > ei gets deleted, it can't be freed until all dentries it references > (including its children) are no longer being accessed. Does that lookup > get called only when a dentry with the name doesn't already exist? Dentry lookup gets called with the parent inode locked for reading, so a lookup can happen in parallel with readdir and other dentry lookup. BUT. Each dentry is also "self-serializing", so you will never see a lookup on the same name (in the same directory) concurrently. The implementation is very much internal to the VFS layer, and it's all kinds of nasty, with a new potential lookup waiting for the old one, verifying that the old one is still usable, and maybe repeating it all until we find a successful previous lookup or we're the only dentry remaining. It's nasty code that is very much in the "Al Viro" camp, but the point is that any normal filesystem should treat lookups as being concurrent with non-creation events, but not concurrently multiples. There *is* some common code with "atomic_open()", where filesystems that implement that then want to know if it's the *first* lookup, or a use of a previously looked up dentry, and they'll use the "d_in_lookup()" thing to determine that. So this whole "keep track of which dentries are *currently* being looked up is actually exposed, but any normal filesystem should never care. But if you think you have that issue (tracefs does not), you really want to talk to Al extensively. > That is, can I assume that eventfs_root_lookup() is only called when > the VFS file system could not find an existing dentry and it has to > create one? Correct. For any _particular_ name, you should think of lookup as serialized. > If that's the case, then I can remove the ei->dentry and just add a ref > counter that it was accessed. Then the final dput() should call > eventfs_set_ei_status_free() (I hate that name and need to change it), > and if the ei->is_freed is set, it can free the ei. Note that the final 'dput()' will happen *after* the dentry has been removed, so what can happen is lookup("name", d1); ... lookup successful, dentry is used .. ... dentry at some point has no more users .. ... memory pressure prunes unused dentries .. ... dentry gets unhashed and is no longer visible .. lookup("name", d2); ... new dentry is created .. final dput(d1); .. old dentry - that wasn't accessible any more is freed .. and this is actually one of the *reasons* that virtual filesystems must not try to cache dentry pointers in their internal data structures. Because note how the fuilesystem saw the new lookup(d2) of the same name *before* it saw the >d_release(d1) of the old dentry. And the above is fundamental: we obviously cannot call '->d_release()' until the old dentry is all dead and buried (lockref_mark_dead() etc), so pretty much by definition you'll have that ordering being possible. It's extremely unlikely, of course. I'll be you'll never hit it in testing. So if if you associate some internal data structure with a dentry, just *what* happens when you haven't been told abotu the old dentry being dead when the new one happens? See why I say that it's fundamentally wrong for a filesystem to try to track dentries? All the operations that can use a dentry will get one passed down to them by the VFS layer. The filesystem has no business trying to remember some dentry from a previous operation, and the filesystem *will* get it wrong. But also note how refcounting works fine. In fact, refcounting is pretty much the *only* thing that works fine. So what you *should* do is - at lookup(), when you save your filesystem data in "->d_fsdata", you increment a refcount - at ->d_release(), you decrement a refcount and now you're fine. Yes, when the above (very very unlikely) situation happens, you'll temporarily have a refcount incremented twice, but that's kind of the *point* of refcounts. Side note: this is pretty much true of any kernel data structure. If you have a kernel data structure that isn't just used within one thread, it must be refcounted. But it'as *doubly* true when you save references to something that the VFS maintains, because you DO NOT CONTROL the lifetime of that entity. > The eventfs_remove_dir() can free the ei (after SRCU) if it has no > references, otherwise it needs to wait for the final dput() to do the > free. Honestly, you should just *always* do refcounting. No "free after RCU delay" as an alternative. Just refcount it. Now, the RCU delay may be needed if the lookup of said structure happens under RCU, but no, saying "I use SRCU to make sure the lifetime is at least X" is just broken. The refcount is what gives the lifetime. Any form of RCU-delaying should then be purely about non-refcounting RCU lookups that may happen as the thing is dying (and said lookup should *look* at the refcount and say "oh, this is dead, I'm not returning this". > I think the ei->dentry was required for the dir wrapper logic that we > removed. I think all of this was due to the bogus readdir that created dentries willy-nilly and without the required serialization. And that was all horribly broken. It wasn't even the above kind of "really subtle race that you'll never hit in practice" broken. It was just absolutely broken with readdir and lookup racing on the same name and creating an unholy dentry mess. Linus