Omar Sandoval <osandov@xxxxxxxxxxx> writes: > On Mon, Mar 30, 2015 at 02:30:34PM +0200, David Sterba wrote: >> On Mon, Mar 30, 2015 at 02:02:17AM -0700, Omar Sandoval wrote: >> > Before commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"), >> > d_invalidate() could return -EBUSY when a dentry for a directory had >> > more than one reference to it. This is what prevented a mounted >> > subvolume from being deleted, as struct vfsmount holds a reference to >> > the subvolume dentry. However, that commit removed that case, and later >> > commits in that patch series removed the return code from d_invalidate() >> > completely, so we don't get that check for free anymore. So, reintroduce >> > it in btrfs_ioctl_snap_destroy(). >> >> > This applies to 4.0-rc6. To be honest, I'm not sure that this is the most >> > correct fix for this bug, but it's equivalent to the pre-3.18 behavior and it's >> > the best that I could come up with. Thoughts? >> >> > + spin_lock(&dentry->d_lock); >> > + err = dentry->d_lockref.count > 1 ? -EBUSY : 0; >> > + spin_unlock(&dentry->d_lock); >> >> The fix restores the original behaviour, but I don't think opencoding and >> using internals is fine. Either there should be a vfs api for that or >> there's an existing one that can be used instead. I have a problem with restoring the original behavior as is. In some sense it re-introduces the security issue that the d_invalidate changes were built to fix. Any user in the system can create a user namespace, create a mount namespace and keep any subvolume pinned forever. Which at the very least could make a very nice DOS attack. I am not familiar enough with how people use subvolumes and So let me ask. How can userspace not know that a subvolume that they want to delete is already mounted? I can see having something like is_local_mount_root and denying the subvolume destruction if the mount that is pinning it is in your local mount namespace. >> The bug here seems defined up to the point that we're trying to delete a >> subvolume that's a mountpoint. My next guess is that a check >> >> if (d_mountpoint(&dentry)) { ... } >> >> could work. > > That was my first instinct as well, but d_mountpoint() is true for > dentries that have a filesystem mounted on them (e.g., after mount > /dev/sda1 /mnt, the dentry for "/mnt"), not the dentry that is mounted. > > I poked around the mount code for awhile and couldn't come up with > anything using the existing interface. Mounting subvolumes bubbles down > to mount_subtree(), which doesn't really leave any traces of which > subvolume is mounted except for the dentry in struct vfsmount. > > (As far as I can tell, under the covers subvolume deletion is more or > less equivalent to an rm -rf, and we obviously don't do anything to stop > users from doing that on the root of their mounted filesystem, but it > appears that users expect the original behavior.) > > Here's an idea: mark mount root dentries as such in the VFS and check it > in the Btrfs code. Adding fsdevel ML for comments > (https://lkml.org/lkml/2015/3/30/125 is the original message). Marking root dentries is needed to fix the bug that you can escape the limitations of loopback mounts with a carefully placed rename. I have a patch cooking that marks mountpoints and tracks all of the mounts on a dentry. So except for the possibility of stepping on each others toes I have no objections. Eric > ---- > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 74609b9..8a0933d 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -2384,6 +2384,11 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, > goto out_dput; > } > > + if (d_is_mount_root(dentry)) { > + err = -EBUSY; > + goto out_dput; > + } > + > mutex_lock(&inode->i_mutex); > > /* > diff --git a/fs/namespace.c b/fs/namespace.c > index 82ef140..a28ca15 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -920,6 +920,10 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void > return ERR_CAST(root); > } > > + spin_lock(&root->d_lock); > + root->d_flags |= DCACHE_MOUNT_ROOT; > + spin_unlock(&root->d_lock); > + > mnt->mnt.mnt_root = root; > mnt->mnt.mnt_sb = root->d_sb; > mnt->mnt_mountpoint = mnt->mnt.mnt_root; > @@ -1017,6 +1021,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, > > static void cleanup_mnt(struct mount *mnt) > { > + struct dentry *root = mnt->mnt.mnt_root; > + > /* > * This probably indicates that somebody messed > * up a mnt_want/drop_write() pair. If this > @@ -1031,7 +1037,10 @@ static void cleanup_mnt(struct mount *mnt) > if (unlikely(mnt->mnt_pins.first)) > mnt_pin_kill(mnt); > fsnotify_vfsmount_delete(&mnt->mnt); > - dput(mnt->mnt.mnt_root); > + spin_lock(&root->d_lock); > + root->d_flags &= ~DCACHE_MOUNT_ROOT; > + spin_unlock(&root->d_lock); > + dput(root); > deactivate_super(mnt->mnt.mnt_sb); > mnt_free_id(mnt); > call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt); > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index d835879..d974ab8 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -225,6 +225,7 @@ struct dentry_operations { > > #define DCACHE_MAY_FREE 0x00800000 > #define DCACHE_FALLTHRU 0x01000000 /* Fall through to lower layer */ > +#define DCACHE_MOUNT_ROOT 0x02000000 /* is the root of a mount */ > > extern seqlock_t rename_lock; > > @@ -401,6 +402,16 @@ static inline bool d_mountpoint(const struct dentry *dentry) > return dentry->d_flags & DCACHE_MOUNTED; > } > > +static inline bool d_is_mount_root(const struct dentry *dentry) > +{ > + bool ret; > + > + spin_lock(&dentry->d_lock); > + ret = dentry->d_flags & DCACHE_MOUNT_ROOT; > + spin_unlock(&dentry->d_lock); > + return ret; > +} > + > /* > * Directory cache entry type accessor functions. > */ > ---- -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html