On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner <christian.brauner@xxxxxxxxxx> wrote: > One of the use-cases for loopfs is to allow to dynamically allocate loop > devices in sandboxed workloads without exposing /dev or > /dev/loop-control to the workload in question and without having to > implement a complex and also racy protocol to send around file > descriptors for loop devices. With loopfs each mount is a new instance, > i.e. loop devices created in one loopfs instance are independent of any > loop devices created in another loopfs instance. This allows > sufficiently privileged tools to have their own private stash of loop > device instances. Dmitry has expressed his desire to use this for > syzkaller in a private discussion. And various parties that want to use > it are Cced here too. > > In addition, the loopfs filesystem can be mounted by user namespace root > and is thus suitable for use in containers. Combined with syscall > interception this makes it possible to securely delegate mounting of > images on loop devices, i.e. when a user calls mount -o loop <image> > <mountpoint> it will be possible to completely setup the loop device. > The final mount syscall to actually perform the mount will be handled > through syscall interception and be performed by a sufficiently > privileged process. Syscall interception is already supported through a > new seccomp feature we implemented in [1] and extended in [2] and is > actively used in production workloads. The additional loopfs work will > be used there and in various other workloads too. You'll find a short > illustration how this works with syscall interception below in [4]. Would that privileged process then allow you to mount your filesystem images with things like ext4? As far as I know, the filesystem maintainers don't generally consider "untrusted filesystem image" to be a strongly enforced security boundary; and worse, if an attacker has access to a loop device from which something like ext4 is mounted, things like "struct ext4_dir_entry_2" will effectively be in shared memory, and an attacker can trivially bypass e.g. ext4_check_dir_entry(). At the moment, that's not a huge problem (for anything other than kernel lockdown) because only root normally has access to loop devices. Ubuntu carries an out-of-tree patch that afaik blocks the shared memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4> But even with that patch, I'm not super excited about exposing filesystem image parsing attack surface to containers unless you run the filesystem in a sandboxed environment (at which point you don't need a loop device anymore either).