On Wed, Apr 8, 2020 at 6:41 PM Stéphane Graber <stgraber@xxxxxxxxxx> wrote: > > On Wed, Apr 8, 2020 at 12:24 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner > > <christian.brauner@xxxxxxxxxx> wrote: > > > One of the use-cases for loopfs is to allow to dynamically allocate loop > > > devices in sandboxed workloads without exposing /dev or > > > /dev/loop-control to the workload in question and without having to > > > implement a complex and also racy protocol to send around file > > > descriptors for loop devices. With loopfs each mount is a new instance, > > > i.e. loop devices created in one loopfs instance are independent of any > > > loop devices created in another loopfs instance. This allows > > > sufficiently privileged tools to have their own private stash of loop > > > device instances. Dmitry has expressed his desire to use this for > > > syzkaller in a private discussion. And various parties that want to use > > > it are Cced here too. > > > > > > In addition, the loopfs filesystem can be mounted by user namespace root > > > and is thus suitable for use in containers. Combined with syscall > > > interception this makes it possible to securely delegate mounting of > > > images on loop devices, i.e. when a user calls mount -o loop <image> > > > <mountpoint> it will be possible to completely setup the loop device. > > > The final mount syscall to actually perform the mount will be handled > > > through syscall interception and be performed by a sufficiently > > > privileged process. Syscall interception is already supported through a > > > new seccomp feature we implemented in [1] and extended in [2] and is > > > actively used in production workloads. The additional loopfs work will > > > be used there and in various other workloads too. You'll find a short > > > illustration how this works with syscall interception below in [4]. > > > > Would that privileged process then allow you to mount your filesystem > > images with things like ext4? As far as I know, the filesystem > > maintainers don't generally consider "untrusted filesystem image" to > > be a strongly enforced security boundary; and worse, if an attacker > > has access to a loop device from which something like ext4 is mounted, > > things like "struct ext4_dir_entry_2" will effectively be in shared > > memory, and an attacker can trivially bypass e.g. > > ext4_check_dir_entry(). At the moment, that's not a huge problem (for > > anything other than kernel lockdown) because only root normally has > > access to loop devices. > > > > Ubuntu carries an out-of-tree patch that afaik blocks the shared > > memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4> > > > > But even with that patch, I'm not super excited about exposing > > filesystem image parsing attack surface to containers unless you run > > the filesystem in a sandboxed environment (at which point you don't > > need a loop device anymore either). > > So in general we certainly agree that you should never expose someone > that you wouldn't trust with root on the host to syscall interception > mounting of real kernel filesystems. > > But that's not all that our syscall interception logic can do. We have > support for rewriting a normal filesystem mount attempt to instead use > an available FUSE implementation. As far as the user is concerned, > they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted > on /mnt as requested, except that the container manager intercepted > the mount attempt and instead spawned fuse2fs for that mount. This > requires absolutely no change to the software the user is running. > > loopfs, with that interception mode, will let us also handle all cases > where a loop would be used, similarly without needing any change to > the software being run. If a piece of software calls the command > "mount -o loop blah.img /mnt", the "mount" command will setup a loop > device as it normally would (doing so through loopfs) and then will > call the "mount" syscall, which will get intercepted and redirected to > a FUSE implementation if so configured, resulting in the expected > filesystem being mounted for the user. > > LXD with syscall interception offers both straight up privileged > mounting using the kernel fs or using a FUSE based implementation. > This is configurable on a per-filesystem and per-container basis. > > I hope that clarifies what we're doing here :) > > Stéphane Hi Christian, Our use case for loopfs in syzkaller would be isolation of several test processes from each other. Currently all loop devices and loop-control are global and cause test processes to collide, which in turn causes non-reproducible coverage and non-reproducible crashes. Ideally we give each test process its own loopfs instance.