On Fri, Mar 15, 2019 at 10:38:23AM -0700, Eric Biggers wrote: > On Fri, Mar 15, 2019 at 01:50:21AM +0000, Al Viro wrote: > > > > If it fails, we call __lock_parent(). Which > > * grabs RCU lock > > * drops ->d_lock (now we are not holding ->d_lock > > on anything). > > * fetches ->d_parent. Note the READ_ONCE() there - > > it's *NOT* stable (no ->d_lock held). We can't expect > > that ->d_parent won't change or that the reference it used > > to contribute to parent's refcount is there anymore; as > > the matter of fact, the only thing that prevents outright > > _freeing_ of the object 'parent' points to is rcu_read_lock() > > and RCU delay between dropping the last reference and > > actual freeing of the sucker. rcu_read_lock() is there, > > though, which makes it safe to grab ->d_lock on 'parent'. > > > > That 'parent' might very well have nothing to do with our > > dentry by now. We can check if it's equal to its > > ->d_parent, though. dentry->d_parent is *NOT* stable > > at that point. It might be changing right now. > > > > However, the first store to dentry->d_parent making it > > not equal to parent would have been done under parent->d_lock. > > And since we are holding parent->d_lock, we won't miss that > > store. We might miss subsequent ones, but if we observe > > dentry->d_parent == parent, we know that it's stable. And > > if we see dentry->d_parent != parent, we know that dentry > > has moved around and we need to retry anyway. > > Why isn't it necessary to use READ_ONCE(dentry->d_parent) here? > > if (unlikely(parent != dentry->d_parent)) { > > Suppose 'parent' is 0xAAAABBBB, and 'dentry->d_parent' is 0xAAAAAAAA and is > concurrently changed to 0xBBBBBBBB. > > d_parent could be read in two parts, 0xAAAA then 0xBBBB, resulting in it > appearing that d_parent == 0xAAAABBBB == parent. > > Yes it won't really be compiled as that in practice, but I thought the point of > READ_ONCE() is to *guarantee* it's really done right... READ_ONCE does not add any extra warranties of atomicity. Fetches and stores of pointers are atomic, period; if that ever breaks, we are in a very deep trouble all over the place. What's more, spin_lock acts as a compiler barrier and, on SMP, is an ACQUIRE operation. So that second fetch of ->d_parent will happen after we grab parent->d_lock, from everyone's POV. Critical areas for the same spinlock are ordered wrt each other. So we have observed FETCH dentry->d_parent => parent LOCK parent->d_lock FETCH dentry->d_parent => parent All stores to dentry->d_parent are done with ->d_lock held on dentry, old value of dentry->d_parent *and* new value of dentry->d_parent. So the second fetch is ordered wrt all stores making dentry->d_parent change to parent and all stores making it change *from* parent. We might miss some stores changing it from one value other than parent to another such, but the predicate itself is fine and will stay fine until we drop parent->d_lock. Paul, could you comment on that one? The function in question is this: static struct dentry *__lock_parent(struct dentry *dentry) { struct dentry *parent; rcu_read_lock(); spin_unlock(&dentry->d_lock); again: parent = READ_ONCE(dentry->d_parent); spin_lock(&parent->d_lock); /* * We can't blindly lock dentry until we are sure * that we won't violate the locking order. * Any changes of dentry->d_parent must have * been done with parent->d_lock held, so * spin_lock() above is enough of a barrier * for checking if it's still our child. */ if (unlikely(parent != dentry->d_parent)) { spin_unlock(&parent->d_lock); goto again; } rcu_read_unlock(); if (parent != dentry) spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); else parent = NULL; return parent; } (in fs/dcache.c) and all stores to ->d_parent are guaranteed to be done under ->d_lock on dentry itself and ->d_lock on both old and new values of ->d_parent.