[RFC] d_prune

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The rcu path walk changes for 2.6.38 shone light in some dark corners 
where Ceph was using the dcache in racy ways.  The main problem is this:

 * The Ceph MDS server gives us lease information such that we know the
   contents for a directory won't change.
 * We want to do lookup on non-existant items without interacting with the
   server.  (We also want to do readdir, but that's a more complicated
   case.)
 * The existing hooks (d_release is what we were using) do not give us
   the opportunity to clear our "this directory is completely cached" flag
   prior to the dentry being unhashed.
 * d_lookup can't look at the "complete" flag and conclude a dentry 
   doesn't exist without worrying about a race with the pruner.

There are two cases where this matters:

 * The VFS does a lookup prior to any create, which means we do two server 
   requests instead of one.  Some VFS refactoring could probably fix that 
   (and Al has some ideas there).
 * A user looks up a non-existent file.  This should not require a server
   request at all.

The race we care about is with the pruner (shrink_dentry_list and 
shrink_dcache_for_umount_subtree).  Dropping dentries due to actual 
changes (rename, unlink, rmdir) all go through the usual d_ops where we 
have ample opportunity to the right thing (with the exception of one 
dentry_unhash in vfs_rename_dir() :/).

Is something like the patch below sane?  It notifies the fs prior to 
unhashing the dentry, giving Ceph the chance to clear its "complete" bit.  
Are there other reasons the VFS would drop dentries that I'm missing?

Some alternatives we've considered:

 * Double-caching.  We could keep a copy of directory contents in the fs
   layer and use that to do the lookup.  Yuck.
 * Put the "complete" bit in the dcache.  The problem is it's a flag on 
   the parent, d_flags is protected by d_lock, and we can't take the parent's
   d_lock while holding the child's d_lock (which we do while pruning).  
   Extra work that most fs's don't need.
 * Serializing lookups against the pruner in some other way.

Any thoughts or suggestions?

Thanks! 
sage


diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 4471a41..180e14b 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -28,6 +28,7 @@ d_revalidate: no              no              yes (ref-walk)  maybe
 d_hash         no              no              no              maybe
 d_compare:     yes             no              no              maybe
 d_delete:      no              yes             no              no
+d_prune:        no              yes             no              no
 d_release:     no              no              yes             no
 d_iput:                no              no              yes             no
 d_dname:       no              no              no              no
diff --git a/fs/dcache.c b/fs/dcache.c
index 2a6bd9a..cdb5d81 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -690,6 +690,8 @@ static void try_prune_one_dentry(struct dentry *dentry)
                        spin_unlock(&dentry->d_lock);
                        return;
                }
+               if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
+                       dentry->d_op->d_prune(dentry);
                dentry = dentry_kill(dentry, 1);
        }
 }
@@ -896,6 +898,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 
        /* detach this root from the system */
        spin_lock(&dentry->d_lock);
+       if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
+               dentry->d_op->d_prune(dentry);
        dentry_lru_del(dentry);
        __d_drop(dentry);
        spin_unlock(&dentry->d_lock);
@@ -912,6 +916,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
                                            d_u.d_child) {
                                spin_lock_nested(&loop->d_lock,
                                                DENTRY_D_LOCK_NESTED);
+                               if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
+                                       dentry->d_op->d_prune(dentry);
                                dentry_lru_del(loop);
                                __d_drop(loop);
                                spin_unlock(&loop->d_lock);
@@ -1375,6 +1381,8 @@ void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op)
                dentry->d_flags |= DCACHE_OP_REVALIDATE;
        if (op->d_delete)
                dentry->d_flags |= DCACHE_OP_DELETE;
+       if (op->d_prune)
+               dentry->d_flags |= DCACHE_OP_PRUNE;
 
 }
 EXPORT_SYMBOL(d_set_d_op);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f958c19..1e83bd8 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -165,6 +165,7 @@ struct dentry_operations {
                        unsigned int, const char *, const struct qstr *);
        int (*d_delete)(const struct dentry *);
        void (*d_release)(struct dentry *);
+       void (*d_prune)(struct dentry *);
        void (*d_iput)(struct dentry *, struct inode *);
        char *(*d_dname)(struct dentry *, char *, int);
        struct vfsmount *(*d_automount)(struct path *);
@@ -219,6 +220,8 @@ struct dentry_operations {
 #define DCACHE_MANAGED_DENTRY \
        (DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
 
+#define DCACHE_OP_PRUNE         0x80000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(struct dentry *dentry)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux