Al, this is mainly a rambling question for you, but others left on the Cc just FYI.. On Thu, Aug 29, 2013 at 4:42 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > So I fixed the VFS layer instead. With dput() and friends using > lockrefs, the only thing remaining in the hot RCU dentry lookup path > was the nasty __d_rcu_to_refcount() thing in complete_walk(). I > rewrote that to locklessly increment the refcount when it was nonzero, > and get the lock if it was zero, and that all seems to work fine. Today I realized that my new simplified d_rcu_to_refcount() was fundamentally buggy in a really annoying way. I had worried mainly about the races with the dentry being invalidated, and how we must not touch the dentry reference count if it might be dead, because once the dentry has been killed, it cannot be killed again. My lockref_get_or_lock() avoided all that by making sure that it always got the reference on a dentry that was still live, and this everything was fine. If it turns out that the sequence count says that the lookup wasn't valid, we could just dput() that dentry, and everything was fine. And you know what? That's even true. It's true from a dentry standpoint. But today I was - for totally unrelated reasons, just as a result of looking at performance profiles - looking at dput(), and realized that dput() really *can* sleep. Not for any actual dentry reasons, but because the inode associated with the dentry is being released. And the reason I noticed this from the performance profile was that the dput code did if (dentry->d_lockref.count == 1) might_sleep(); where reading that lockref count showed up in the profile. But that test never triggered any of my thinking, and it's racy anyway, and the condition we care about is another race, which means that it not only didn't trigger my thinking, it doesn't trigger in any normal testing either. The fact is, dput() can always sleep, regardless of count, because of the race (ok, the exception being that you actually have multiple references to the same dentry _yourself_, but who does that? Nobody). Anyway, the reason the new d_rcu_to_refcount() is buggy isn't because it does bad things to dentries - the dentry reference counts and lifetimes are perfectly fine. No, the reason it does bad things is that it does an otherwise perfectly fine dput() in a context where it definitely is _not_ fine to sleep. We are in RCU lookup, so we are in a RCU-locked region, we hold the percpu vfsmount_lock spinlock, and in fact in unlazy_walk() we also potentially hold the "fs->lock" spinlock. Ugh, ugh, ugh. I really like the much simplified d_rcu_to_refcount() conceptually, though. So my current plan is to keep it, but to just delay the "dput()" it does until after we've done the "unlock_rcu_walk()". "complete_walk()" is actually quite simple to fix, because it does the unlock_rcu_walk() itself - so it's fairly trivial to just move the dput() to be after it. In preparation for this, I already ended up committing my "dead lockref" dentry code, because that simplifies d_rcu_to_refcount() to the degree that splitting the code up into the callers and then moving the dput() looks like the right thing to do. The problem I see right now is unlazy_walk(). It does *not* do the unlock_rcu_walk() itself for the error case (and it's the error case that needs this), but instead just returns -ECHILD and expects the caller to do it. Even then it happens fairly obscurely in "terminate_walk()". So Al, any ideas? I currently have two: - always do that unlock_rcu_walk() in unlazy_walk(), and exit RCU mode even for the error case (similar to how complete_walk() does it). That makes solving this for unlazy_walk() as straightforward as it is for complete_walk(), and this is, I think, the right thing to do. - add a couple of "to be freed" dentry pointers to the 'nd' and let terminate_walk() do the dput(). This leaves the unlazy_walk() semantics alone. The second one is hacky, but it doesn't change the semantics of unlzay_walk(). I think the current semantics (drop out of RCU mode on success, but not on failure) are nasty, but I wonder if there is some really subtle reason for it. So the hacky second solution has got that whole "no change to odd/subtle semantics" thing going for it.. Comments? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html