On Wed, Apr 22, 2020 at 04:54:32PM +0200, Christian Brauner wrote: > This implements loopfs, a loop device filesystem. It takes inspiration > from the binderfs filesystem I implemented about two years ago and with > which we had overall good experiences so far. Parts of it are also > based on [3] but it's mostly a new, imho cleaner approach. > > Loopfs allows to create private loop devices instances to applications > for various use-cases. It covers the use-case that was expressed on-list > and in-person to get programmatic access to private loop devices for > image building in sandboxes. An illustration for this is provided in > [4]. > > Also loopfs is intended to provide loop devices to privileged and > unprivileged containers which has been a frequent request from various > major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm > providing a non-exhaustive list of issues and requests (cf. [5]) around > this feature mainly to illustrate that I'm not making the use-cases up. > Currently none of this can be done safely since handing a loop device > from the host into a container means that the container can see anything > that the host is doing with that loop device and what other containers > are doing with that device too. And (bind-)mounting devtmpfs inside of > containers is not secure at all so also not an option (though sometimes > done out of despair apparently). > > The workloads people run in containers are supposed to be indiscernible > from workloads run on the host and the tools inside of the container are > supposed to not be required to be aware that they are running inside a > container apart from containerization tools themselves. This is > especially true when running older distros in containers that did exist > before containers were as ubiquitous as they are today. With loopfs user > can call mount -o loop and in a correctly setup container things work > the same way they would on the host. The filesystem representation > allows us to do this in a very simple way. At container setup, a > container manager can mount a private instance of loopfs somehwere, e.g. > at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control > to /dev/loop-control, pre allocate and symlink the number of standard > devices into their standard location and have a service file or rules in > place that symlink additionally allocated loop devices through losetup > into place as well. > With the new syscall interception logic this is also possible for > unprivileged containers. In these cases when a user calls mount -o loop > <image> <mountpoint> it will be possible to completely setup the loop > device in the container. The final mount syscall is handled through > syscall interception which we already implemented and released in > earlier kernels (see [1] and [2]) and is actively used in production > workloads. The mount is often rewritten to a fuse binary to provide safe > access for unprivileged containers. > > Loopfs also allows the creation of hidden/detached dynamic loop devices > and associated mounts which also was a often issued request. With the > old mount api this can be achieved by creating a temporary loopfs and > stashing a file descriptor to the mount point and the loop-control > device and immediately unmounting the loopfs instance. With the new > mount api a detached mount can be created directly (i.e. a mount not > visible anywhere in the filesystem). New loop devices can then be > allocated and configured. They can be mounted through > /proc/self/<fd>/<nr> with the old mount api or by using the fd directly > with the new mount api. Combined with a mount namespace this allows for > fully auto-cleaned up loop devices on program crash. This ties back to > various use-cases and is illustrated in [4]. > > The filesystem representation requires the standard boilerplate > filesystem code we know from other tiny filesystems. And all of > the loopfs code is hidden under a config option that defaults to false. > This specifically means, that none of the code even exists when users do > not have any use-case for loopfs. > In addition, the loopfs code does not alter how loop devices behave at > all, i.e. there are no changes to any existing workloads and I've taken > care to ifdef all loopfs specific things out. > > Each loopfs mount is a separate instance. As such loop devices created > in one instance are independent of loop devices created in another > instance. This specifically entails that loop devices are only visible > in the loopfs instance they belong to. > > The number of loop devices available in loopfs instances are > hierarchically limited through /proc/sys/user/max_loop_devices via the > ucount infrastructure (Thanks to David Rheinsberg for pointing out that > missing piece.). An administrator could e.g. set > echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs > instance mounted by uid x can only create 3 loop devices no matter how > many loopfs instances they mount. This limit applies hierarchically to > all user namespaces. Hm, info->device_count is per loopfs mount, though, right? I don't see where this gets incremented for all of a user's loopfs mounts when one adds a loopdev? I'm sure I'm missing something obvious... > In addition, loopfs has a "max" mount option which allows to set a limit > on the number of loop devices for a given loopfs instance. This is > mainly to cover use-cases where a single loopfs mount is shared as a > bind-mount between multiple parties that are prevented from creating > other loopfs mounts and is equivalent to the semantics of the binderfs > and devpts "max" mount option. > > Note that in __loop_clr_fd() we now need not just check whether bdev is > valid but also whether bdev->bd_disk is valid. This wasn't necessary > before because in order to call LOOP_CLR_FD the loop device would need > to be open and thus bdev->bd_disk was guaranteed to be allocated. For > loopfs loop devices we allow callers to simply unlink them just as we do > for binderfs binder devices and we do also need to account for the case > where a loopfs superblock is shutdown while backing files might still be > associated with some loop devices. In such cases no bd_disk device will > be attached to bdev. This is not in itself noteworthy it's more about > documenting the "why" of the added bdev->bd_disk check for posterity. > > [1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") > [2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE") > [3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@xxxxxxxxxxxxx > [4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f > [5]: https://github.com/kubernetes-sigs/kind/issues/1333 > https://github.com/kubernetes-sigs/kind/issues/1248 > https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html > https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount > https://gitlab.com/gitlab-com/support-forum/issues/3732 > https://github.com/moby/moby/issues/27886 > https://twitter.com/_AkihiroSuda_/status/1249664478267854848 > https://serverfault.com/questions/701384/loop-device-in-a-linux-container > https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352 > https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813 > Cc: Jens Axboe <axboe@xxxxxxxxx> > Cc: Steve Barber <smbarber@xxxxxxxxxx> > Cc: Filipe Brandenburger <filbranden@xxxxxxxxx> > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > Cc: Benjamin Elder <bentheelder@xxxxxxxxxx> > Cc: Seth Forshee <seth.forshee@xxxxxxxxxxxxx> > Cc: Stéphane Graber <stgraber@xxxxxxxxxx> > Cc: Tom Gundersen <teg@xxxxxxx> > Cc: Serge Hallyn <serge@xxxxxxxxxx> Reviewed-by: Serge Hallyn <serge@xxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > Cc: Christian Kellner <ckellner@xxxxxxxxxx> > Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > Cc: "David S. Miller" <davem@xxxxxxxxxxxxx> > Cc: Dylan Reid <dgreid@xxxxxxxxxx> > Cc: David Rheinsberg <david.rheinsberg@xxxxxxxxx> > Cc: Akihiro Suda <suda.kyoto@xxxxxxxxx> > Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx> > Cc: "Rafael J. Wysocki" <rafael@xxxxxxxxxx> > Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx> > --- > /* v2 */ > - David Rheinsberg <david.rheinsberg@xxxxxxxxx> / > Christian Brauner <christian.brauner@xxxxxxxxxx>: > - Correctly cleanup loop devices that are in-use after the loopfs > instance has been shut down. This is important for some use-cases > that David pointed out where they effectively create a loopfs > instance, allocate devices and drop unnecessary references to it. > - Christian Brauner <christian.brauner@xxxxxxxxxx>: > - Replace lo_loopfs_i inode member in struct loop_device with a custom > struct lo_info pointer which is only allocated for loopfs loop > devices. > --- > MAINTAINERS | 5 + > drivers/block/Kconfig | 4 + > drivers/block/Makefile | 1 + > drivers/block/loop.c | 200 ++++++++++--- > drivers/block/loop.h | 12 +- > drivers/block/loopfs/Makefile | 3 + > drivers/block/loopfs/loopfs.c | 494 +++++++++++++++++++++++++++++++++ > drivers/block/loopfs/loopfs.h | 36 +++ > include/linux/user_namespace.h | 3 + > include/uapi/linux/magic.h | 1 + > kernel/ucount.c | 3 + > 11 files changed, 721 insertions(+), 41 deletions(-) > create mode 100644 drivers/block/loopfs/Makefile > create mode 100644 drivers/block/loopfs/loopfs.c > create mode 100644 drivers/block/loopfs/loopfs.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index b816a453b10e..560b37a65bce 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -9957,6 +9957,11 @@ W: http://www.avagotech.com/support/ > F: drivers/message/fusion/ > F: drivers/scsi/mpt3sas/ > > +LOOPFS FILE SYSTEM > +M: Christian Brauner <christian.brauner@xxxxxxxxxx> > +S: Supported > +F: drivers/block/loopfs/ > + > LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers > M: Matthew Wilcox <willy@xxxxxxxxxxxxx> > L: linux-scsi@xxxxxxxxxxxxxxx > diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig > index 025b1b77b11a..d7ff37d795ad 100644 > --- a/drivers/block/Kconfig > +++ b/drivers/block/Kconfig > @@ -214,6 +214,10 @@ config BLK_DEV_LOOP > > Most users will answer N here. > > +config BLK_DEV_LOOPFS > + bool "Loopback device virtual filesystem support" > + depends on BLK_DEV_LOOP=y > + > config BLK_DEV_LOOP_MIN_COUNT > int "Number of loop devices to pre-create at init time" > depends on BLK_DEV_LOOP > diff --git a/drivers/block/Makefile b/drivers/block/Makefile > index 795facd8cf19..7052be26aa8b 100644 > --- a/drivers/block/Makefile > +++ b/drivers/block/Makefile > @@ -36,6 +36,7 @@ obj-$(CONFIG_XEN_BLKDEV_BACKEND) += xen-blkback/ > obj-$(CONFIG_BLK_DEV_DRBD) += drbd/ > obj-$(CONFIG_BLK_DEV_RBD) += rbd.o > obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/ > +obj-$(CONFIG_BLK_DEV_LOOPFS) += loopfs/ > > obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/ > obj-$(CONFIG_ZRAM) += zram/ > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index da693e6a834e..52f7583dd17d 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -81,6 +81,10 @@ > > #include "loop.h" > > +#ifdef CONFIG_BLK_DEV_LOOPFS > +#include "loopfs/loopfs.h" > +#endif > + > #include <linux/uaccess.h> > > static DEFINE_IDR(loop_index_idr); > @@ -1115,6 +1119,24 @@ loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer, > return err; > } > > +static void loop_remove(struct loop_device *lo) > +{ > +#ifdef CONFIG_BLK_DEV_LOOPFS > + loopfs_remove(lo); > +#endif > + del_gendisk(lo->lo_disk); > + blk_cleanup_queue(lo->lo_queue); > + blk_mq_free_tag_set(&lo->tag_set); > + put_disk(lo->lo_disk); > + kfree(lo); > +} > + > +static inline void __loop_remove(struct loop_device *lo) > +{ > + idr_remove(&loop_index_idr, lo->lo_number); > + loop_remove(lo); > +} > + > static int __loop_clr_fd(struct loop_device *lo, bool release) > { > struct file *filp = NULL; > @@ -1164,7 +1186,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) > } > set_capacity(lo->lo_disk, 0); > loop_sysfs_exit(lo); > - if (bdev) { > + if (bdev && bdev->bd_disk) { > bd_set_size(bdev, 0); > /* let user-space know about this change */ > kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE); > @@ -1174,7 +1196,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) > module_put(THIS_MODULE); > blk_mq_unfreeze_queue(lo->lo_queue); > > - partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev; > + partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev && bdev->bd_disk; > lo_number = lo->lo_number; > loop_unprepare_queue(lo); > out_unlock: > @@ -1213,7 +1235,12 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) > lo->lo_flags = 0; > if (!part_shift) > lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN; > - lo->lo_state = Lo_unbound; > +#ifdef CONFIG_BLK_DEV_LOOPFS > + if (loopfs_wants_remove(lo)) > + __loop_remove(lo); > + else > +#endif > + lo->lo_state = Lo_unbound; > mutex_unlock(&loop_ctl_mutex); > > /* > @@ -1259,6 +1286,74 @@ static int loop_clr_fd(struct loop_device *lo) > return __loop_clr_fd(lo, false); > } > > +#ifdef CONFIG_BLK_DEV_LOOPFS > +int loopfs_rundown_locked(struct loop_device *lo) > +{ > + int ret; > + > + if (WARN_ON_ONCE(!loopfs_device(lo))) > + return -EINVAL; > + > + ret = mutex_lock_killable(&loop_ctl_mutex); > + if (ret) > + return ret; > + > + if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) { > + ret = -EBUSY; > + } else { > + /* > + * Since the device is unbound it has no associated backing > + * file and we can safely set Lo_rundown to prevent it from > + * being found. Actual cleanup happens during inode eviction. > + */ > + lo->lo_state = Lo_rundown; > + ret = 0; > + } > + > + mutex_unlock(&loop_ctl_mutex); > + return ret; > +} > + > +/** > + * loopfs_evict_locked() - remove loop device or mark inactive > + * @lo: loopfs loop device > + * > + * This function will remove a loop device. If it has no users > + * and is bound the backing file will be cleaned up. If the loop > + * device has users it will be marked for auto cleanup. > + * This function is only called when a loopfs instance is shutdown > + * when all references to it from this loopfs instance have been > + * dropped. If there are still any references to it cleanup will > + * happen in lo_release(). > + */ > +void loopfs_evict_locked(struct loop_device *lo) > +{ > + struct lo_loopfs *lo_info; > + struct inode *lo_inode; > + > + WARN_ON_ONCE(!loopfs_device(lo)); > + > + mutex_lock(&loop_ctl_mutex); > + lo_info = lo->lo_info; > + lo_inode = lo_info->lo_inode; > + lo_info->lo_inode = NULL; > + lo_info->lo_flags |= LOOPFS_FLAGS_INACTIVE; > + > + if (atomic_read(&lo->lo_refcnt) > 0) { > + lo->lo_flags |= LO_FLAGS_AUTOCLEAR; > + } else { > + lo->lo_state = Lo_rundown; > + lo->lo_disk->private_data = NULL; > + lo_inode->i_private = NULL; > + > + mutex_unlock(&loop_ctl_mutex); > + __loop_clr_fd(lo, false); > + return; > + } > + mutex_unlock(&loop_ctl_mutex); > +} > +#endif /* CONFIG_BLK_DEV_LOOPFS */ > + > static int > loop_set_status(struct loop_device *lo, const struct loop_info64 *info) > { > @@ -1842,7 +1937,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode) > > if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) { > if (lo->lo_state != Lo_bound) > - goto out_unlock; > + goto out_remove; > lo->lo_state = Lo_rundown; > mutex_unlock(&loop_ctl_mutex); > /* > @@ -1860,6 +1955,12 @@ static void lo_release(struct gendisk *disk, fmode_t mode) > blk_mq_unfreeze_queue(lo->lo_queue); > } > > +out_remove: > +#ifdef CONFIG_BLK_DEV_LOOPFS > + if (lo->lo_state != Lo_bound && loopfs_wants_remove(lo)) > + __loop_remove(lo); > +#endif > + > out_unlock: > mutex_unlock(&loop_ctl_mutex); > } > @@ -1878,6 +1979,11 @@ static const struct block_device_operations lo_fops = { > * And now the modules code and kernel interface. > */ > static int max_loop; > +#ifdef CONFIG_BLK_DEV_LOOPFS > +unsigned long max_devices; > +#else > +static unsigned long max_devices; > +#endif > module_param(max_loop, int, 0444); > MODULE_PARM_DESC(max_loop, "Maximum number of loop devices"); > module_param(max_part, int, 0444); > @@ -2006,7 +2112,7 @@ static const struct blk_mq_ops loop_mq_ops = { > .complete = lo_complete_rq, > }; > > -static int loop_add(struct loop_device **l, int i) > +static int loop_add(struct loop_device **l, int i, struct inode *inode) > { > struct loop_device *lo; > struct gendisk *disk; > @@ -2096,7 +2202,17 @@ static int loop_add(struct loop_device **l, int i) > disk->private_data = lo; > disk->queue = lo->lo_queue; > sprintf(disk->disk_name, "loop%d", i); > + > add_disk(disk); > + > +#ifdef CONFIG_BLK_DEV_LOOPFS > + err = loopfs_add(lo, inode, disk_devt(disk)); > + if (err) { > + __loop_remove(lo); > + goto out; > + } > +#endif > + > *l = lo; > return lo->lo_number; > > @@ -2112,36 +2228,41 @@ static int loop_add(struct loop_device **l, int i) > return err; > } > > -static void loop_remove(struct loop_device *lo) > -{ > - del_gendisk(lo->lo_disk); > - blk_cleanup_queue(lo->lo_queue); > - blk_mq_free_tag_set(&lo->tag_set); > - put_disk(lo->lo_disk); > - kfree(lo); > -} > +struct find_free_cb_data { > + struct loop_device **l; > + struct inode *inode; > +}; > > static int find_free_cb(int id, void *ptr, void *data) > { > struct loop_device *lo = ptr; > - struct loop_device **l = data; > + struct find_free_cb_data *cb_data = data; > > - if (lo->lo_state == Lo_unbound) { > - *l = lo; > - return 1; > - } > - return 0; > + if (lo->lo_state != Lo_unbound) > + return 0; > + > +#ifdef CONFIG_BLK_DEV_LOOPFS > + if (!loopfs_access(cb_data->inode, lo)) > + return 0; > +#endif > + > + *cb_data->l = lo; > + return 1; > } > > -static int loop_lookup(struct loop_device **l, int i) > +static int loop_lookup(struct loop_device **l, int i, struct inode *inode) > { > struct loop_device *lo; > int ret = -ENODEV; > > if (i < 0) { > int err; > + struct find_free_cb_data cb_data = { > + .l = &lo, > + .inode = inode, > + }; > > - err = idr_for_each(&loop_index_idr, &find_free_cb, &lo); > + err = idr_for_each(&loop_index_idr, &find_free_cb, &cb_data); > if (err == 1) { > *l = lo; > ret = lo->lo_number; > @@ -2152,6 +2273,11 @@ static int loop_lookup(struct loop_device **l, int i) > /* lookup and return a specific i */ > lo = idr_find(&loop_index_idr, i); > if (lo) { > +#ifdef CONFIG_BLK_DEV_LOOPFS > + if (!loopfs_access(inode, lo)) > + return -EACCES; > +#endif > + > *l = lo; > ret = lo->lo_number; > } > @@ -2166,9 +2292,9 @@ static struct kobject *loop_probe(dev_t dev, int *part, void *data) > int err; > > mutex_lock(&loop_ctl_mutex); > - err = loop_lookup(&lo, MINOR(dev) >> part_shift); > + err = loop_lookup(&lo, MINOR(dev) >> part_shift, NULL); > if (err < 0) > - err = loop_add(&lo, MINOR(dev) >> part_shift); > + err = loop_add(&lo, MINOR(dev) >> part_shift, NULL); > if (err < 0) > kobj = NULL; > else > @@ -2192,15 +2318,15 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd, > ret = -ENOSYS; > switch (cmd) { > case LOOP_CTL_ADD: > - ret = loop_lookup(&lo, parm); > + ret = loop_lookup(&lo, parm, file_inode(file)); > if (ret >= 0) { > ret = -EEXIST; > break; > } > - ret = loop_add(&lo, parm); > + ret = loop_add(&lo, parm, file_inode(file)); > break; > case LOOP_CTL_REMOVE: > - ret = loop_lookup(&lo, parm); > + ret = loop_lookup(&lo, parm, file_inode(file)); > if (ret < 0) > break; > if (lo->lo_state != Lo_unbound) { > @@ -2212,14 +2338,13 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd, > break; > } > lo->lo_disk->private_data = NULL; > - idr_remove(&loop_index_idr, lo->lo_number); > - loop_remove(lo); > + __loop_remove(lo); > break; > case LOOP_CTL_GET_FREE: > - ret = loop_lookup(&lo, -1); > + ret = loop_lookup(&lo, -1, file_inode(file)); > if (ret >= 0) > break; > - ret = loop_add(&lo, -1); > + ret = loop_add(&lo, -1, file_inode(file)); > } > mutex_unlock(&loop_ctl_mutex); > > @@ -2246,7 +2371,6 @@ MODULE_ALIAS("devname:loop-control"); > static int __init loop_init(void) > { > int i, nr; > - unsigned long range; > struct loop_device *lo; > int err; > > @@ -2285,10 +2409,10 @@ static int __init loop_init(void) > */ > if (max_loop) { > nr = max_loop; > - range = max_loop << part_shift; > + max_devices = max_loop << part_shift; > } else { > nr = CONFIG_BLK_DEV_LOOP_MIN_COUNT; > - range = 1UL << MINORBITS; > + max_devices = 1UL << MINORBITS; > } > > err = misc_register(&loop_misc); > @@ -2301,13 +2425,13 @@ static int __init loop_init(void) > goto misc_out; > } > > - blk_register_region(MKDEV(LOOP_MAJOR, 0), range, > + blk_register_region(MKDEV(LOOP_MAJOR, 0), max_devices, > THIS_MODULE, loop_probe, NULL, NULL); > > /* pre-create number of devices given by config or max_loop */ > mutex_lock(&loop_ctl_mutex); > for (i = 0; i < nr; i++) > - loop_add(&lo, i); > + loop_add(&lo, i, NULL); > mutex_unlock(&loop_ctl_mutex); > > printk(KERN_INFO "loop: module loaded\n"); > @@ -2329,14 +2453,10 @@ static int loop_exit_cb(int id, void *ptr, void *data) > > static void __exit loop_exit(void) > { > - unsigned long range; > - > - range = max_loop ? max_loop << part_shift : 1UL << MINORBITS; > - > idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); > idr_destroy(&loop_index_idr); > > - blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range); > + blk_unregister_region(MKDEV(LOOP_MAJOR, 0), max_devices); > unregister_blkdev(LOOP_MAJOR, "loop"); > > misc_deregister(&loop_misc); > diff --git a/drivers/block/loop.h b/drivers/block/loop.h > index af75a5ee4094..6fed746b6124 100644 > --- a/drivers/block/loop.h > +++ b/drivers/block/loop.h > @@ -17,6 +17,10 @@ > #include <linux/kthread.h> > #include <uapi/linux/loop.h> > > +#ifdef CONFIG_BLK_DEV_LOOPFS > +#include "loopfs/loopfs.h" > +#endif > + > /* Possible states of device */ > enum { > Lo_unbound, > @@ -62,6 +66,9 @@ struct loop_device { > struct request_queue *lo_queue; > struct blk_mq_tag_set tag_set; > struct gendisk *lo_disk; > +#ifdef CONFIG_BLK_DEV_LOOPFS > + struct lo_loopfs *lo_info; > +#endif > }; > > struct loop_cmd { > @@ -89,6 +96,9 @@ struct loop_func_table { > }; > > int loop_register_transfer(struct loop_func_table *funcs); > -int loop_unregister_transfer(int number); > +int loop_unregister_transfer(int number); > +#ifdef CONFIG_BLK_DEV_LOOPFS > +extern unsigned long max_devices; > +#endif > > #endif > diff --git a/drivers/block/loopfs/Makefile b/drivers/block/loopfs/Makefile > new file mode 100644 > index 000000000000..87ec703b662e > --- /dev/null > +++ b/drivers/block/loopfs/Makefile > @@ -0,0 +1,3 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +loopfs-y := loopfs.o > +obj-$(CONFIG_BLK_DEV_LOOPFS) += loopfs.o > diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c > new file mode 100644 > index 000000000000..b3461c72b6e7 > --- /dev/null > +++ b/drivers/block/loopfs/loopfs.c > @@ -0,0 +1,494 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#include <linux/fs.h> > +#include <linux/fs_parser.h> > +#include <linux/fsnotify.h> > +#include <linux/genhd.h> > +#include <linux/init.h> > +#include <linux/list.h> > +#include <linux/magic.h> > +#include <linux/major.h> > +#include <linux/miscdevice.h> > +#include <linux/module.h> > +#include <linux/mount.h> > +#include <linux/namei.h> > +#include <linux/sched.h> > +#include <linux/slab.h> > +#include <linux/seq_file.h> > + > +#include "../loop.h" > +#include "loopfs.h" > + > +#define FIRST_INODE 1 > +#define SECOND_INODE 2 > +#define INODE_OFFSET 3 > + > +enum loopfs_param { > + Opt_max, > +}; > + > +const struct fs_parameter_spec loopfs_fs_parameters[] = { > + fsparam_u32("max", Opt_max), > + {} > +}; > + > +struct loopfs_mount_opts { > + int max; > +}; > + > +struct loopfs_info { > + kuid_t root_uid; > + kgid_t root_gid; > + unsigned long device_count; > + struct dentry *control_dentry; > + struct loopfs_mount_opts mount_opts; > +}; > + > +static inline struct loopfs_info *LOOPFS_SB(const struct super_block *sb) > +{ > + return sb->s_fs_info; > +} > + > +struct super_block *loopfs_i_sb(const struct inode *inode) > +{ > + if (inode && inode->i_sb->s_magic == LOOPFS_SUPER_MAGIC) > + return inode->i_sb; > + > + return NULL; > +} > + > +bool loopfs_device(const struct loop_device *lo) > +{ > + return lo->lo_info != NULL; > +} > + > +struct user_namespace *loopfs_ns(const struct loop_device *lo) > +{ > + if (loopfs_device(lo)) { > + struct super_block *sb; > + > + sb = loopfs_i_sb(lo->lo_info->lo_inode); > + if (sb) > + return sb->s_user_ns; > + } > + > + return &init_user_ns; > +} > + > +bool loopfs_access(const struct inode *first, struct loop_device *lo) > +{ > + return loopfs_device(lo) && > + loopfs_i_sb(first) == loopfs_i_sb(lo->lo_info->lo_inode); > +} > + > +bool loopfs_wants_remove(const struct loop_device *lo) > +{ > + return lo->lo_info && (lo->lo_info->lo_flags & LOOPFS_FLAGS_INACTIVE); > +} > + > +/** > + * loopfs_add - allocate inode from super block of a loopfs mount > + * @lo: loop device for which we are creating a new device entry > + * @ref_inode: inode from wich the super block will be taken > + * @device_nr: device number of the associated disk device > + * > + * This function creates a new device node for @lo. > + * Minor numbers are limited and tracked globally. The > + * function will stash a struct loop_device for the specific loop > + * device in i_private of the inode. > + * It will go on to allocate a new inode from the super block of the > + * filesystem mount, stash a struct loop_device in its i_private field > + * and attach a dentry to that inode. > + * > + * Return: 0 on success, negative errno on failure > + */ > +int loopfs_add(struct loop_device *lo, struct inode *ref_inode, dev_t device_nr) > +{ > + int ret; > + char name[DISK_NAME_LEN]; > + struct super_block *sb; > + struct loopfs_info *info; > + struct dentry *root, *dentry; > + struct inode *inode; > + struct lo_loopfs *lo_info; > + > + sb = loopfs_i_sb(ref_inode); > + if (!sb) > + return 0; > + > + if (MAJOR(device_nr) != LOOP_MAJOR) > + return -EINVAL; > + > + lo_info = kzalloc(sizeof(struct lo_loopfs), GFP_KERNEL); > + if (!lo_info) { > + ret = -ENOMEM; > + goto err; > + } > + > + info = LOOPFS_SB(sb); > + if ((info->device_count + 1) > info->mount_opts.max) { > + ret = -ENOSPC; > + goto err; > + } > + > + lo_info->lo_ucount = inc_ucount(sb->s_user_ns, > + info->root_uid, UCOUNT_LOOP_DEVICES); > + if (!lo_info->lo_ucount) { > + ret = -ENOSPC; > + goto err; > + } > + > + if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name)) { > + ret = -EINVAL; > + goto err; > + } > + > + inode = new_inode(sb); > + if (!inode) { > + ret = -ENOMEM; > + goto err; > + } > + > + /* > + * The i_fop field will be set to the correct fops by the device layer > + * when the loop device in this loopfs instance is opened. > + */ > + inode->i_ino = MINOR(device_nr) + INODE_OFFSET; > + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); > + inode->i_uid = info->root_uid; > + inode->i_gid = info->root_gid; > + init_special_inode(inode, S_IFBLK | 0600, device_nr); > + > + root = sb->s_root; > + inode_lock(d_inode(root)); > + /* look it up */ > + dentry = lookup_one_len(name, root, strlen(name)); > + if (IS_ERR(dentry)) { > + inode_unlock(d_inode(root)); > + iput(inode); > + ret = PTR_ERR(dentry); > + goto err; > + } > + > + if (d_really_is_positive(dentry)) { > + /* already exists */ > + dput(dentry); > + inode_unlock(d_inode(root)); > + iput(inode); > + ret = -EEXIST; > + goto err; > + } > + > + d_instantiate(dentry, inode); > + fsnotify_create(d_inode(root), dentry); > + inode_unlock(d_inode(root)); > + > + lo_info->lo_inode = inode; > + lo->lo_info = lo_info; > + inode->i_private = lo; > + info->device_count++; > + > + return 0; > + > +err: > + if (lo_info->lo_ucount) > + dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES); > + kfree(lo_info); > + return ret; > +} > + > +void loopfs_remove(struct loop_device *lo) > +{ > + struct lo_loopfs *lo_info = lo->lo_info; > + struct inode *inode; > + struct super_block *sb; > + struct dentry *root, *dentry; > + > + if (!lo_info) > + return; > + > + inode = lo_info->lo_inode; > + if (!inode || !S_ISBLK(inode->i_mode) || imajor(inode) != LOOP_MAJOR) > + goto out; > + > + sb = loopfs_i_sb(inode); > + lo_info->lo_inode = NULL; > + > + /* > + * The root dentry is always the parent dentry since we don't allow > + * creation of directories. > + */ > + root = sb->s_root; > + > + inode_lock(d_inode(root)); > + dentry = d_find_any_alias(inode); > + if (dentry && simple_positive(dentry)) { > + simple_unlink(d_inode(root), dentry); > + d_delete(dentry); > + } > + dput(dentry); > + inode_unlock(d_inode(root)); > + LOOPFS_SB(sb)->device_count--; > + > +out: > + if (lo_info->lo_ucount) > + dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES); > + kfree(lo->lo_info); > + lo->lo_info = NULL; > +} > + > +static void loopfs_fs_context_free(struct fs_context *fc) > +{ > + struct loopfs_mount_opts *ctx = fc->fs_private; > + > + kfree(ctx); > +} > + > +/** > + * loopfs_loop_ctl_create - create a new loop-control device > + * @sb: super block of the loopfs mount > + * > + * This function creates a new loop-control device node in the loopfs mount > + * referred to by @sb. > + * > + * Return: 0 on success, negative errno on failure > + */ > +static int loopfs_loop_ctl_create(struct super_block *sb) > +{ > + struct dentry *dentry; > + struct inode *inode = NULL; > + struct dentry *root = sb->s_root; > + struct loopfs_info *info = sb->s_fs_info; > + > + if (info->control_dentry) > + return 0; > + > + inode = new_inode(sb); > + if (!inode) > + return -ENOMEM; > + > + inode->i_ino = SECOND_INODE; > + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); > + init_special_inode(inode, S_IFCHR | 0600, > + MKDEV(MISC_MAJOR, LOOP_CTRL_MINOR)); > + /* > + * The i_fop field will be set to the correct fops by the device layer > + * when the loop-control device in this loopfs instance is opened. > + */ > + inode->i_uid = info->root_uid; > + inode->i_gid = info->root_gid; > + > + dentry = d_alloc_name(root, "loop-control"); > + if (!dentry) { > + iput(inode); > + return -ENOMEM; > + } > + > + info->control_dentry = dentry; > + d_add(dentry, inode); > + > + return 0; > +} > + > +static inline bool is_loopfs_control_device(const struct dentry *dentry) > +{ > + return LOOPFS_SB(dentry->d_sb)->control_dentry == dentry; > +} > + > +static int loopfs_rename(struct inode *old_dir, struct dentry *old_dentry, > + struct inode *new_dir, struct dentry *new_dentry, > + unsigned int flags) > +{ > + if (is_loopfs_control_device(old_dentry) || > + is_loopfs_control_device(new_dentry)) > + return -EPERM; > + > + return simple_rename(old_dir, old_dentry, new_dir, new_dentry, flags); > +} > + > +static int loopfs_unlink(struct inode *dir, struct dentry *dentry) > +{ > + int ret; > + struct loop_device *lo; > + > + if (is_loopfs_control_device(dentry)) > + return -EPERM; > + > + lo = d_inode(dentry)->i_private; > + ret = loopfs_rundown_locked(lo); > + if (ret) > + return ret; > + > + return simple_unlink(dir, dentry); > +} > + > +static const struct inode_operations loopfs_dir_inode_operations = { > + .lookup = simple_lookup, > + .rename = loopfs_rename, > + .unlink = loopfs_unlink, > +}; > + > +static void loopfs_evict_inode(struct inode *inode) > +{ > + struct loop_device *lo = inode->i_private; > + > + clear_inode(inode); > + > + if (lo && S_ISBLK(inode->i_mode) && imajor(inode) == LOOP_MAJOR) { > + loopfs_evict_locked(lo); > + LOOPFS_SB(inode->i_sb)->device_count--; > + inode->i_private = NULL; > + } > +} > + > +static int loopfs_show_options(struct seq_file *seq, struct dentry *root) > +{ > + struct loopfs_info *info = LOOPFS_SB(root->d_sb); > + > + if (info->mount_opts.max <= max_devices) > + seq_printf(seq, ",max=%d", info->mount_opts.max); > + > + return 0; > +} > + > +static void loopfs_put_super(struct super_block *sb) > +{ > + struct loopfs_info *info = sb->s_fs_info; > + > + sb->s_fs_info = NULL; > + kfree(info); > +} > + > +static const struct super_operations loopfs_super_ops = { > + .evict_inode = loopfs_evict_inode, > + .show_options = loopfs_show_options, > + .statfs = simple_statfs, > + .put_super = loopfs_put_super, > +}; > + > +static int loopfs_fill_super(struct super_block *sb, struct fs_context *fc) > +{ > + struct loopfs_info *info; > + struct loopfs_mount_opts *ctx = fc->fs_private; > + struct inode *inode = NULL; > + > + sb->s_blocksize = PAGE_SIZE; > + sb->s_blocksize_bits = PAGE_SHIFT; > + > + sb->s_iflags &= ~SB_I_NODEV; > + sb->s_iflags |= SB_I_NOEXEC; > + sb->s_magic = LOOPFS_SUPER_MAGIC; > + sb->s_op = &loopfs_super_ops; > + sb->s_time_gran = 1; > + > + sb->s_fs_info = kzalloc(sizeof(struct loopfs_info), GFP_KERNEL); > + if (!sb->s_fs_info) > + return -ENOMEM; > + info = sb->s_fs_info; > + > + info->root_gid = make_kgid(sb->s_user_ns, 0); > + if (!gid_valid(info->root_gid)) > + info->root_gid = GLOBAL_ROOT_GID; > + info->root_uid = make_kuid(sb->s_user_ns, 0); > + if (!uid_valid(info->root_uid)) > + info->root_uid = GLOBAL_ROOT_UID; > + info->mount_opts.max = ctx->max; > + > + inode = new_inode(sb); > + if (!inode) > + return -ENOMEM; > + > + inode->i_ino = FIRST_INODE; > + inode->i_fop = &simple_dir_operations; > + inode->i_mode = S_IFDIR | 0755; > + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); > + inode->i_op = &loopfs_dir_inode_operations; > + set_nlink(inode, 2); > + > + sb->s_root = d_make_root(inode); > + if (!sb->s_root) > + return -ENOMEM; > + > + return loopfs_loop_ctl_create(sb); > +} > + > +static int loopfs_fs_context_get_tree(struct fs_context *fc) > +{ > + return get_tree_nodev(fc, loopfs_fill_super); > +} > + > +static int loopfs_fs_context_parse_param(struct fs_context *fc, > + struct fs_parameter *param) > +{ > + int opt; > + struct loopfs_mount_opts *ctx = fc->fs_private; > + struct fs_parse_result result; > + > + opt = fs_parse(fc, loopfs_fs_parameters, param, &result); > + if (opt < 0) > + return opt; > + > + switch (opt) { > + case Opt_max: > + if (result.uint_32 > max_devices) > + return invalfc(fc, "Bad value for '%s'", param->key); > + > + ctx->max = result.uint_32; > + break; > + default: > + return invalfc(fc, "Unsupported parameter '%s'", param->key); > + } > + > + return 0; > +} > + > +static int loopfs_fs_context_reconfigure(struct fs_context *fc) > +{ > + struct loopfs_mount_opts *ctx = fc->fs_private; > + struct loopfs_info *info = LOOPFS_SB(fc->root->d_sb); > + > + info->mount_opts.max = ctx->max; > + return 0; > +} > + > +static const struct fs_context_operations loopfs_fs_context_ops = { > + .free = loopfs_fs_context_free, > + .get_tree = loopfs_fs_context_get_tree, > + .parse_param = loopfs_fs_context_parse_param, > + .reconfigure = loopfs_fs_context_reconfigure, > +}; > + > +static int loopfs_init_fs_context(struct fs_context *fc) > +{ > + struct loopfs_mount_opts *ctx = fc->fs_private; > + > + ctx = kzalloc(sizeof(struct loopfs_mount_opts), GFP_KERNEL); > + if (!ctx) > + return -ENOMEM; > + > + ctx->max = max_devices; > + > + fc->fs_private = ctx; > + > + fc->ops = &loopfs_fs_context_ops; > + > + return 0; > +} > + > +static struct file_system_type loop_fs_type = { > + .name = "loop", > + .init_fs_context = loopfs_init_fs_context, > + .parameters = loopfs_fs_parameters, > + .kill_sb = kill_litter_super, > + .fs_flags = FS_USERNS_MOUNT, > +}; > + > +int __init init_loopfs(void) > +{ > + init_user_ns.ucount_max[UCOUNT_LOOP_DEVICES] = 255; > + return register_filesystem(&loop_fs_type); > +} > + > +module_init(init_loopfs); > +MODULE_AUTHOR("Christian Brauner <christian.brauner@xxxxxxxxxx>"); > +MODULE_DESCRIPTION("Loop device filesystem"); > diff --git a/drivers/block/loopfs/loopfs.h b/drivers/block/loopfs/loopfs.h > new file mode 100644 > index 000000000000..2ee114aa3fa9 > --- /dev/null > +++ b/drivers/block/loopfs/loopfs.h > @@ -0,0 +1,36 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#ifndef _LINUX_LOOPFS_FS_H > +#define _LINUX_LOOPFS_FS_H > + > +#include <linux/errno.h> > +#include <linux/fs.h> > +#include <linux/magic.h> > +#include <linux/user_namespace.h> > + > +struct loop_device; > + > +#ifdef CONFIG_BLK_DEV_LOOPFS > + > +#define LOOPFS_FLAGS_INACTIVE (1 << 0) > + > +struct lo_loopfs { > + struct ucounts *lo_ucount; > + struct inode *lo_inode; > + int lo_flags; > +}; > + > +extern struct super_block *loopfs_i_sb(const struct inode *inode); > +extern bool loopfs_device(const struct loop_device *lo); > +extern struct user_namespace *loopfs_ns(const struct loop_device *lo); > +extern bool loopfs_access(const struct inode *first, struct loop_device *lo); > +extern int loopfs_add(struct loop_device *lo, struct inode *ref_inode, > + dev_t device_nr); > +extern void loopfs_remove(struct loop_device *lo); > +extern bool loopfs_wants_remove(const struct loop_device *lo); > +extern void loopfs_evict_locked(struct loop_device *lo); > +extern int loopfs_rundown_locked(struct loop_device *lo); > + > +#endif > + > +#endif /* _LINUX_LOOPFS_FS_H */ > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h > index 6ef1c7109fc4..04a4891765c0 100644 > --- a/include/linux/user_namespace.h > +++ b/include/linux/user_namespace.h > @@ -49,6 +49,9 @@ enum ucount_type { > #ifdef CONFIG_INOTIFY_USER > UCOUNT_INOTIFY_INSTANCES, > UCOUNT_INOTIFY_WATCHES, > +#endif > +#ifdef CONFIG_BLK_DEV_LOOPFS > + UCOUNT_LOOP_DEVICES, > #endif > UCOUNT_COUNTS, > }; > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h > index d78064007b17..0817d093a012 100644 > --- a/include/uapi/linux/magic.h > +++ b/include/uapi/linux/magic.h > @@ -75,6 +75,7 @@ > #define BINFMTFS_MAGIC 0x42494e4d > #define DEVPTS_SUPER_MAGIC 0x1cd1 > #define BINDERFS_SUPER_MAGIC 0x6c6f6f70 > +#define LOOPFS_SUPER_MAGIC 0x6c6f6f71 > #define FUTEXFS_SUPER_MAGIC 0xBAD1DEA > #define PIPEFS_MAGIC 0x50495045 > #define PROC_SUPER_MAGIC 0x9fa0 > diff --git a/kernel/ucount.c b/kernel/ucount.c > index 11b1596e2542..fb0f6394a8bb 100644 > --- a/kernel/ucount.c > +++ b/kernel/ucount.c > @@ -73,6 +73,9 @@ static struct ctl_table user_table[] = { > #ifdef CONFIG_INOTIFY_USER > UCOUNT_ENTRY("max_inotify_instances"), > UCOUNT_ENTRY("max_inotify_watches"), > +#endif > +#ifdef CONFIG_BLK_DEV_LOOPFS > + UCOUNT_ENTRY("max_loop_devices"), > #endif > { } > }; > -- > 2.26.1