On Thu, Mar 10, 2016 at 04:43:16AM +0000, Al Viro wrote: > On Thu, Mar 10, 2016 at 03:46:43AM +0000, Drokin, Oleg wrote: > > > > Wait a minute. If it's hashed, has the right name and the right parent, > > > why the hell are we calling ->lookup() on a new dentry in the first place? > > > Why hadn't we simply picked it from dcache? > > > > This is because of the trickery we do in the d_compare. > > our d_compare looks at the "invalid" flag and if it's set, returns "not matching", > > triggering the lookup instead of revalidate. > > This makes revalidate simple and fast. > > (We used to have a complicated revalidate with a lot of code duplication with > > lookup in order to be able to query the server and pass all sorts of data there > > and it was nothing but trouble). > > *Ugh*... That's really nasty. We certainly could make d_exact_match() > accept unhashed ones and make rehashing conditional (NFS doesn't pull > anything similar, so it won't care), but your ->d_revalidate() > has exact same problem as ext4_d_revalidate() one mentioned upthread - > there's no warranty that dentry->d_parent will stay stable. > > We are *NOT* guaranteed locked parent when ->d_revalidate() is called, or > we would have to lock every damn directory on the way through the pathname > resolution. Moreover, ->d_revalidate() really can overlap with rename(2). PS: there's a reason why e.g. NFS ->d_revalidate() is doing if (flags & LOOKUP_RCU) { parent = ACCESS_ONCE(dentry->d_parent); dir = d_inode_rcu(parent); if (!dir) return -ECHILD; } else { parent = dget_parent(dentry); dir = d_inode(parent); } and so do other instances. It does *not* guarantee that parent will remain the parent through the whole thing (or will still be one by the time dget_parent() caller gets the return value), but it does guarantee that it won't get freed under you. Note that the original parent won't disappear (it's pinned by the caller), but there's no promise that what you'll fetch from dentry->d_parent inside the method will have anything to do with that. BTW, we might be better off if we passed the parent and child as separate arguments... By the quick look through the instances, we have * a bunch that don't look at the parent at all * some that use dget_parent()/dput() (and often enough use only ->d_inode of the parent). * some that look at it under dentry->d_lock - that's enough for stability, but can't block. ceph, BTW, does igrab() of parent's inode under ->d_lock, uses it outside of ->d_lock and iput() in the end. * kernfs, which serializes just about everything on a single system-wide mutex. * lustre (and ext4 crypto in -next) - broken Only the third class (and actually only one instance in there - vfat) wouldn't be just as fine if we passed it the parent as argument. VFAT one does spin_lock(&dentry->d_lock); if (dentry->d_time != d_inode(dentry->d_parent)->i_version) ret = 0; spin_unlock(&dentry->d_lock); That one does care about ->d_time and ->d_parent being from the same moment. And it can bloody well keep doing what it does. Reducing the amount of dget_parent() callers would also be nice... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html