Hi On Wed, Apr 8, 2020 at 5:27 PM Christian Brauner <christian.brauner@xxxxxxxxxx> wrote: > > This implements loopfs, a loop device filesystem. It takes inspiration > from the binderfs filesystem I implemented about two years ago and with > which we had overally good experiences so far. Parts of it are also > based on [3] but it's mostly a new, imho cleaner approach. > > One of the use-cases for loopfs is to allow to dynamically allocate loop > devices in sandboxed workloads without exposing /dev or > /dev/loop-control to the workload in question and without having to > implement a complex and also racy protocol to send around file > descriptors for loop devices. With loopfs each mount is a new instance, > i.e. loop devices created in one loopfs instance are independent of any > loop devices created in another loopfs instance. This allows > sufficiently privileged tools to have their own private stash of loop > device instances. > > In addition, the loopfs filesystem can be mounted by user namespace root > and is thus suitable for use in containers. Combined with syscall > interception this makes it possible to securely delegate mounting of > images on loop devices, i.e. when a users calls mount -o loop <image> > <mountpoint> it will be possible to completely setup the loop device > (enabled in later patches) and the mount syscall to actually perform the > mount will be handled through syscall interception and be performed by a > sufficiently privileged process. Syscall interception is already > supported through a new seccomp feature we implemented in [1] and > extended in [2] and is actively used in production workloads. The > additional loopfs work will be used there and in various other workloads > too. > > The number of loop devices available to a loopfs instance can be limited > by setting the "max" mount option to a positive integer. This e.g. > allows sufficiently privileged processes to dynamically enforce a limit > on the number of devices. This limit is dynamic in contrast to the > max_loop module option in that a sufficiently privileged process can > update it with a simple remount operation. > > The loopfs filesystem is placed under a new config option and special > care has been taken to not introduce any new code when users do not > select this config option. > > Note that in __loop_clr_fd() we now need not just check whether bdev is > valid but also whether bdev->bd_disk is valid. This wasn't necessary > before because in order to call LOOP_CLR_FD the loop device would need > to be open and thus bdev->bd_disk was guaranteed to be allocated. For > loopfs loop devices we allow callers to simply unlink them just as we do > for binderfs binder devices and we do also need to account for the case > where a loopfs superblock is shutdown while backing files might still be > associated with some loop devices. In such cases no bd_disk device will > be attached to bdev. This is not in itself noteworthy it's more about > documenting the "why" of the added bdev->bd_disk check for posterity. > > [1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") > [2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE") > [3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@xxxxxxxxxxxxx > Cc: Jens Axboe <axboe@xxxxxxxxx> > Cc: Seth Forshee <seth.forshee@xxxxxxxxxxxxx> > Cc: Tom Gundersen <teg@xxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > Cc: Christian Kellner <ckellner@xxxxxxxxxx> > Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > Cc: "David S. Miller" <davem@xxxxxxxxxxxxx> > Cc: David Rheinsberg <david.rheinsberg@xxxxxxxxx> > Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx> > Cc: "Rafael J. Wysocki" <rafael@xxxxxxxxxx> > Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx> > --- > MAINTAINERS | 5 + > drivers/block/Kconfig | 4 + > drivers/block/Makefile | 1 + > drivers/block/loop.c | 151 +++++++++--- > drivers/block/loop.h | 8 +- > drivers/block/loopfs/Makefile | 3 + > drivers/block/loopfs/loopfs.c | 429 ++++++++++++++++++++++++++++++++++ > drivers/block/loopfs/loopfs.h | 35 +++ > include/uapi/linux/magic.h | 1 + > 9 files changed, 600 insertions(+), 37 deletions(-) > create mode 100644 drivers/block/loopfs/Makefile > create mode 100644 drivers/block/loopfs/loopfs.c > create mode 100644 drivers/block/loopfs/loopfs.h > [...] > diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c > new file mode 100644 > index 000000000000..ac46aa337008 > --- /dev/null > +++ b/drivers/block/loopfs/loopfs.c > @@ -0,0 +1,429 @@ [...] > +/** > + * loopfs_loop_device_create - allocate inode from super block of a loopfs mount > + * @lo: loop device for which we are creating a new device entry > + * @ref_inode: inode from wich the super block will be taken > + * @device_nr: device number of the associated disk device > + * > + * This function creates a new device node for @lo. > + * Minor numbers are limited and tracked globally. The > + * function will stash a struct loop_device for the specific loop > + * device in i_private of the inode. > + * It will go on to allocate a new inode from the super block of the > + * filesystem mount, stash a struct loop_device in its i_private field > + * and attach a dentry to that inode. > + * > + * Return: 0 on success, negative errno on failure > + */ > +int loopfs_loop_device_create(struct loop_device *lo, struct inode *ref_inode, > + dev_t device_nr) > +{ > + char name[DISK_NAME_LEN]; > + struct super_block *sb; > + struct loopfs_info *info; > + struct dentry *root, *dentry; > + struct inode *inode; > + > + sb = loopfs_i_sb(ref_inode); > + if (!sb) > + return 0; > + > + if (MAJOR(device_nr) != LOOP_MAJOR) > + return -EINVAL; > + > + info = LOOPFS_SB(sb); > + if ((info->device_count + 1) > info->mount_opts.max) > + return -ENOSPC; Can you elaborate what the use-case for this limit is? With loopfs in place, any process can create its own user_ns, mount their private loopfs and create as many loop-devices as they want. Hence, this limit does not serve as an effective global resource-control. Secondly, anyone with access to `loop-control` can now create loop instances until this limit is hit, thus causing anyone else to be unable to create more. This effectively prevents you from sharing a loopfs between non-trusting parties. I am unsure where that limit would actually be used? Thanks David > + > + if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name)) > + return -EINVAL; > + > + inode = new_inode(sb); > + if (!inode) > + return -ENOMEM; > + > + /* > + * The i_fop field will be set to the correct fops by the device layer > + * when the loop device in this loopfs instance is opened. > + */ > + inode->i_ino = MINOR(device_nr) + INODE_OFFSET; > + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); > + inode->i_uid = info->root_uid; > + inode->i_gid = info->root_gid; > + init_special_inode(inode, S_IFBLK | 0600, device_nr); > + > + root = sb->s_root; > + inode_lock(d_inode(root)); > + /* look it up */ > + dentry = lookup_one_len(name, root, strlen(name)); > + if (IS_ERR(dentry)) { > + inode_unlock(d_inode(root)); > + iput(inode); > + return PTR_ERR(dentry); > + } > + > + if (d_really_is_positive(dentry)) { > + /* already exists */ > + dput(dentry); > + inode_unlock(d_inode(root)); > + iput(inode); > + return -EEXIST; > + } > + > + d_instantiate(dentry, inode); > + fsnotify_create(d_inode(root), dentry); > + inode_unlock(d_inode(root)); > + > + inode->i_private = lo; > + lo->lo_loopfs_i = inode; > + info->device_count++; > + > + return 0; > +} [...]