On Fri, Mar 15, 2019 at 10:19:39AM +1100, Tobin C. Harding wrote: > On Fri, Mar 15, 2019 at 09:56:32AM +1100, Tobin C. Harding wrote: > > Hi, > > > > I'm not able to understand the locking order in dcache. dcache has been > > around for a while so clearly its right and I'm wrong. > > > > Could someone please explain to me how the locking order commented at > > the top of the file is not violated in the following: > > > > > From top of fs/dcache.c > > > > * If there is an ancestor relationship: > > * dentry->d_parent->...->d_parent->d_lock > > * ... > > * dentry->d_parent->d_lock > > * dentry->d_lock > > > > This is a more simple example > > static struct dentry *dentry_kill(struct dentry *dentry) > __releases(dentry->d_lock) > { > ... > > slow_positive: > spin_unlock(&dentry->d_lock); > spin_lock(&inode->i_lock); > spin_lock(&dentry->d_lock); > parent = lock_parent(dentry); Yes? Why is that a problem? We start with ->d_lock held. We drop it. At that point we are holding a reference to dentry and no locks whatsoever. We know that inode is not going away (we are holding a reference to positive dentry, so its ->d_inode can't change under us, and it contributes to inode refcount). So we can safely grab its ->i_lock. Then we grab (in normal order) ->d_lock. Then, in lock_parent() we check if ->d_parent points to ourselves. ->d_parent is stabilized by ->d_lock, so that check if fine. If it does *not* point to ourselves, we do trylock on its ->d_parent. Also fine - ->d_parent isn't going anywhere and trylock can't lead to deadlocks; that's what "try" in the name is about. If it succeeds, we have * inode->i_lock held * dentry->d_lock held * parent->d_lock held * inode == dentry->d_inode * parent == dentry->d_parent and we are fine, except that the reference we are holding might not be the only existing one anymore (it used to be, but we'd dropped ->d_lock, so additional ones might've appeared, and we need to repeat the retain_dentry()-related checks). If it fails, we call __lock_parent(). Which * grabs RCU lock * drops ->d_lock (now we are not holding ->d_lock on anything). * fetches ->d_parent. Note the READ_ONCE() there - it's *NOT* stable (no ->d_lock held). We can't expect that ->d_parent won't change or that the reference it used to contribute to parent's refcount is there anymore; as the matter of fact, the only thing that prevents outright _freeing_ of the object 'parent' points to is rcu_read_lock() and RCU delay between dropping the last reference and actual freeing of the sucker. rcu_read_lock() is there, though, which makes it safe to grab ->d_lock on 'parent'. That 'parent' might very well have nothing to do with our dentry by now. We can check if it's equal to its ->d_parent, though. dentry->d_parent is *NOT* stable at that point. It might be changing right now. However, the first store to dentry->d_parent making it not equal to parent would have been done under parent->d_lock. And since we are holding parent->d_lock, we won't miss that store. We might miss subsequent ones, but if we observe dentry->d_parent == parent, we know that it's stable. And if we see dentry->d_parent != parent, we know that dentry has moved around and we need to retry anyway. If we do *not* need to retry, we have * inode->i_lock held * parent->d_lock held * dentry->d_parent == parent * dentry->d_inode == inode At that point we can drop rcu_read_lock() - parent can't be freed under us without dentry->d_parent switched from parent to something else, and anyone trying to do that will spin on parent->d_lock. And now we can grab dentry->d_lock - ordering is guaranteed. The only minor twist is that once upon a time __lock_parent() used to have to cope with the possibility of dentry becoming detached while we were retrying; in that case we obviously don't want to grab dentry->d_lock - we are holding it already (as parent->d_lock). These days (since "make non-exchanging __d_move() copy ->d_parent rather than swap them") if IS_ROOT(dentry) is false at some point, it *never* will become true, so this check could've been dropped, along with the checks for __lock_parent() return value possibly being NULL. I'm not saying it's not subtle and nasty - any time you see READ_ONCE() (or rcu_read_lock(), for that matter), expect headache. But trylock is absolutely innocious part here; we are not even retrying it once.