The rcu path walk changes for 2.6.38 shone light in some dark corners where Ceph was using the dcache in racy ways. The main problem is this: * The Ceph MDS server gives us lease information such that we know the contents for a directory won't change. * We want to do lookup on non-existant items without interacting with the server. (We also want to do readdir, but that's a more complicated case.) * The existing hooks (d_release is what we were using) do not give us the opportunity to clear our "this directory is completely cached" flag prior to the dentry being unhashed. * d_lookup can't look at the "complete" flag and conclude a dentry doesn't exist without worrying about a race with the pruner. There are two cases where this matters: * The VFS does a lookup prior to any create, which means we do two server requests instead of one. Some VFS refactoring could probably fix that (and Al has some ideas there). * A user looks up a non-existent file. This should not require a server request at all. The race we care about is with the pruner (shrink_dentry_list and shrink_dcache_for_umount_subtree). Dropping dentries due to actual changes (rename, unlink, rmdir) all go through the usual d_ops where we have ample opportunity to the right thing (with the exception of one dentry_unhash in vfs_rename_dir() :/). Is something like the patch below sane? It notifies the fs prior to unhashing the dentry, giving Ceph the chance to clear its "complete" bit. Are there other reasons the VFS would drop dentries that I'm missing? Some alternatives we've considered: * Double-caching. We could keep a copy of directory contents in the fs layer and use that to do the lookup. Yuck. * Put the "complete" bit in the dcache. The problem is it's a flag on the parent, d_flags is protected by d_lock, and we can't take the parent's d_lock while holding the child's d_lock (which we do while pruning). Extra work that most fs's don't need. * Serializing lookups against the pruner in some other way. Any thoughts or suggestions? Thanks! sage diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 4471a41..180e14b 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -28,6 +28,7 @@ d_revalidate: no no yes (ref-walk) maybe d_hash no no no maybe d_compare: yes no no maybe d_delete: no yes no no +d_prune: no yes no no d_release: no no yes no d_iput: no no yes no d_dname: no no no no diff --git a/fs/dcache.c b/fs/dcache.c index 2a6bd9a..cdb5d81 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -690,6 +690,8 @@ static void try_prune_one_dentry(struct dentry *dentry) spin_unlock(&dentry->d_lock); return; } + if (dentry->d_op->d_flags & DCACHE_OP_PRUNE) + dentry->d_op->d_prune(dentry); dentry = dentry_kill(dentry, 1); } } @@ -896,6 +898,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) /* detach this root from the system */ spin_lock(&dentry->d_lock); + if (dentry->d_op->d_flags & DCACHE_OP_PRUNE) + dentry->d_op->d_prune(dentry); dentry_lru_del(dentry); __d_drop(dentry); spin_unlock(&dentry->d_lock); @@ -912,6 +916,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) d_u.d_child) { spin_lock_nested(&loop->d_lock, DENTRY_D_LOCK_NESTED); + if (dentry->d_op->d_flags & DCACHE_OP_PRUNE) + dentry->d_op->d_prune(dentry); dentry_lru_del(loop); __d_drop(loop); spin_unlock(&loop->d_lock); @@ -1375,6 +1381,8 @@ void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op) dentry->d_flags |= DCACHE_OP_REVALIDATE; if (op->d_delete) dentry->d_flags |= DCACHE_OP_DELETE; + if (op->d_prune) + dentry->d_flags |= DCACHE_OP_PRUNE; } EXPORT_SYMBOL(d_set_d_op); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index f958c19..1e83bd8 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -165,6 +165,7 @@ struct dentry_operations { unsigned int, const char *, const struct qstr *); int (*d_delete)(const struct dentry *); void (*d_release)(struct dentry *); + void (*d_prune)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)(struct dentry *, char *, int); struct vfsmount *(*d_automount)(struct path *); @@ -219,6 +220,8 @@ struct dentry_operations { #define DCACHE_MANAGED_DENTRY \ (DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT) +#define DCACHE_OP_PRUNE 0x80000 + extern seqlock_t rename_lock; static inline int dname_external(struct dentry *dentry) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html