On Fri, 4 Mar 2011, Sage Weil wrote: > The rcu path walk changes for 2.6.38 shone light in some dark corners > where Ceph was using the dcache in racy ways. The main problem is this: > > * The Ceph MDS server gives us lease information such that we know the > contents for a directory won't change. > * We want to do lookup on non-existant items without interacting with the > server. (We also want to do readdir, but that's a more complicated > case.) > * The existing hooks (d_release is what we were using) do not give us > the opportunity to clear our "this directory is completely cached" flag > prior to the dentry being unhashed. > * d_lookup can't look at the "complete" flag and conclude a dentry > doesn't exist without worrying about a race with the pruner. > > There are two cases where this matters: > > * The VFS does a lookup prior to any create, which means we do two server > requests instead of one. Some VFS refactoring could probably fix that > (and Al has some ideas there). > * A user looks up a non-existent file. This should not require a server > request at all. > > The race we care about is with the pruner (shrink_dentry_list and > shrink_dcache_for_umount_subtree). Dropping dentries due to actual > changes (rename, unlink, rmdir) all go through the usual d_ops where we > have ample opportunity to the right thing (with the exception of one > dentry_unhash in vfs_rename_dir() :/). Unless there are other (non-pruner) cases where the VFS randomly unhashes things, I think vfs_rename_dir() is the main problem with the d_prune method approach. Specifically, vfs_rename_dir() does if (target) dentry_unhash(new_dentry); prior to calling ->rename(). It's not clear to me why vfs_rename_dir() does this but vfs_rename_other() does not. Could it be pushed down into the ->rename() method? Also, the if (d_unhashed(new_dentry)) d_rehash(new_dentry); after ->rename() has always looked fishy/buggy to me. See http://lkml.org/lkml/2008/7/30/165 Thanks- sage > Is something like the patch below sane? It notifies the fs prior to > unhashing the dentry, giving Ceph the chance to clear its "complete" bit. > Are there other reasons the VFS would drop dentries that I'm missing? > > Some alternatives we've considered: > > * Double-caching. We could keep a copy of directory contents in the fs > layer and use that to do the lookup. Yuck. > * Put the "complete" bit in the dcache. The problem is it's a flag on > the parent, d_flags is protected by d_lock, and we can't take the parent's > d_lock while holding the child's d_lock (which we do while pruning). > Extra work that most fs's don't need. > * Serializing lookups against the pruner in some other way. > > Any thoughts or suggestions? > > Thanks! > sage > > > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking > index 4471a41..180e14b 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -28,6 +28,7 @@ d_revalidate: no no yes (ref-walk) maybe > d_hash no no no maybe > d_compare: yes no no maybe > d_delete: no yes no no > +d_prune: no yes no no > d_release: no no yes no > d_iput: no no yes no > d_dname: no no no no > diff --git a/fs/dcache.c b/fs/dcache.c > index 2a6bd9a..cdb5d81 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -690,6 +690,8 @@ static void try_prune_one_dentry(struct dentry *dentry) > spin_unlock(&dentry->d_lock); > return; > } > + if (dentry->d_op->d_flags & DCACHE_OP_PRUNE) > + dentry->d_op->d_prune(dentry); > dentry = dentry_kill(dentry, 1); > } > } > @@ -896,6 +898,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > > /* detach this root from the system */ > spin_lock(&dentry->d_lock); > + if (dentry->d_op->d_flags & DCACHE_OP_PRUNE) > + dentry->d_op->d_prune(dentry); > dentry_lru_del(dentry); > __d_drop(dentry); > spin_unlock(&dentry->d_lock); > @@ -912,6 +916,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > d_u.d_child) { > spin_lock_nested(&loop->d_lock, > DENTRY_D_LOCK_NESTED); > + if (dentry->d_op->d_flags & DCACHE_OP_PRUNE) > + dentry->d_op->d_prune(dentry); > dentry_lru_del(loop); > __d_drop(loop); > spin_unlock(&loop->d_lock); > @@ -1375,6 +1381,8 @@ void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op) > dentry->d_flags |= DCACHE_OP_REVALIDATE; > if (op->d_delete) > dentry->d_flags |= DCACHE_OP_DELETE; > + if (op->d_prune) > + dentry->d_flags |= DCACHE_OP_PRUNE; > > } > EXPORT_SYMBOL(d_set_d_op); > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index f958c19..1e83bd8 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -165,6 +165,7 @@ struct dentry_operations { > unsigned int, const char *, const struct qstr *); > int (*d_delete)(const struct dentry *); > void (*d_release)(struct dentry *); > + void (*d_prune)(struct dentry *); > void (*d_iput)(struct dentry *, struct inode *); > char *(*d_dname)(struct dentry *, char *, int); > struct vfsmount *(*d_automount)(struct path *); > @@ -219,6 +220,8 @@ struct dentry_operations { > #define DCACHE_MANAGED_DENTRY \ > (DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT) > > +#define DCACHE_OP_PRUNE 0x80000 > + > extern seqlock_t rename_lock; > > static inline int dname_external(struct dentry *dentry) > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html