Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 07, 2023 at 08:33:29PM +0100, Giuseppe Scrivano wrote:
> Christian Brauner <brauner@xxxxxxxxxx> writes:
> 
> > On Tue, Mar 07, 2023 at 01:09:57PM +0100, Alexander Larsson wrote:
> >> On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >> >
> >> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> >> > > Hi Alexander,
> >> > >
> >> > > On 2023/3/3 21:57, Alexander Larsson wrote:
> >> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote:
> >> 
> >> > > > But I know for the people who are more interested in using composefs
> >> > > > for containers the eventual goal of rootless support is very
> >> > > > important. So, on behalf of them I guess the question is: Is there
> >> > > > ever any chance that something like composefs could work rootlessly?
> >> > > > Or conversely: Is there some way to get rootless support from the
> >> > > > overlay approach? Opinions? Ideas?
> >> > >
> >> > > Honestly, I do want to get a proper answer when Giuseppe asked me
> >> > > the same question.  My current view is simply "that question is
> >> > > almost the same for all in-kernel fses with some on-disk format".
> >> >
> >> > As far as I'm concerned filesystems with on-disk format will not be made
> >> > mountable by unprivileged containers. And I don't think I'm alone in
> >> > that view. The idea that ever more parts of the kernel with a massive
> >> > attack surface such as a filesystem need to vouchesafe for the safety in
> >> > the face of every rando having access to
> >> > unshare --mount --user --map-root is a dead end and will just end up
> >> > trapping us in a neverending cycle of security bugs (Because every
> >> > single bug that's found after making that fs mountable from an
> >> > unprivileged container will be treated as a security bug no matter if
> >> > justified or not. So this is also a good way to ruin your filesystem's
> >> > reputation.).
> >> >
> >> > And honestly, if we set the precedent that it's fine for one filesystem
> >> > with an on-disk format to be able to be mounted by unprivileged
> >> > containers then other filesystems eventually want to do this as well.
> >> >
> >> > At the rate we currently add filesystems that's just a matter of time
> >> > even if none of the existing ones would also want to do it. And then
> >> > we're left arguing that this was just an exception for one super
> >> > special, super safe, unexploitable filesystem with an on-disk format.
> >> >
> >> > Imho, none of this is appealing. I don't want to slowly keep building a
> >> > future where we end up running fuzzers in unprivileged container to
> >> > generate random images to crash the kernel.
> >> >
> >> > I have more arguments why I don't think is a path we will ever go down
> >> > but I don't want this to detract from the legitimate ask of making it
> >> > possible to mount trusted images from within unprivileged containers.
> >> > Because I think that's perfectly legitimate.
> >> >
> >> > However, I don't think that this is something the kernel needs to solve
> >> > other than providing the necessary infrastructure so that this can be
> >> > solved in userspace.
> >> 
> >> So, I completely understand this point of view. And, since I'm not
> >> really hearing any other viewpoint from the linux vfs developers it
> >> seems to be a shared opinion. So, it seems like further work on the
> >> kernel side of composefs isn't really useful anymore, and I will focus
> >> my work on the overlayfs side. Maybe we can even drop the summit topic
> >> to avoid a bunch of unnecessary travel?
> >> 
> >> That said, even though I understand (and even agree) with your
> >> worries, I feel it is kind of unfortunate that we end up with
> >> (essentially) a setuid helper approach for this. Because it feels like
> >> we're giving up on a useful feature (trustless unprivileged mounts)
> >> that the kernel could *theoretically* deliver, but a setuid helper
> >> can't. Sure, if you have a closed system you can limit what images can
> >> get mounted to images signed by a trusted key, but it won't work well
> >> for things like user built images or publically available images.
> >> Unfortunately practicalities kinda outweigh theoretical advantages.
> >
> > Characterzing this as a setuid helper approach feels a bit like negative
> > branding. :)
> >
> > But just in case there's a misunderstanding of any form let me clarify
> > that systemd doesn't produce set*id binaries in any form; never has,
> > never will.
> >
> > It's also good to remember that in order to even use unprivileged
> > containers with meaningful idmappings __two__ set*id binaries -
> > new*idmap - with an extremely clunky, and frankly unusable id delegation
> > policy expressed through these weird /etc/sub*id files have to be used.
> > Which apparently everyone is happy to use.
> >
> > What we're talking about here however is a first class system service
> > capable of expressing meaningful security policy (e.g., image signed by
> > a key in the kernel keyring, polkit, ...). And such well-scoped local
> > services are a good thing.
> 
> there are some disadvantages too:
> 
> - while the impact on system services is negligible, using the proposed
>   approach could slow down container startup.
>   It is somehow similar to the issue we currently have with cgroups,
>   where manually creating a cgroup is faster than going through dbus and
>   systemd.  IMHO, the kernel could easily verify the image signature

This will use varlink, dbus would be optional and only be involved if
a service wanted to use polkit for trust. Signatures would be the main
way. Efficiency is ofc something that is on the forefront.

That said, note that big chunks of mounting are serialized on namespace
lock (mount propagation et al) and mount lock (properties, parent-child
relationships, mountpoint etc.) already so it's not really that this a
particularly fast operation.

Mounting is expensive in the kernel especially with mount propagation in
the mix. If you have a thousand containers all calling mount at the same
time with mount propagation between them for a big mount tree that'll be
costly. IOW, the cost for mounting isn't paid in userspace.

>   without relying on an additional userland service when mounting it
>   from a user namespace.
> 
> - it won't be usable from a containerized build system.  It is common to
>   build container images inside of a container (so that they can be
>   built in a cluster).  To use the systemd approach, we'll need to
>   access systemd on the host from the container.

I don't see why that would be a problem I consider it the proper design
in fact. And I've explained in the earlier mail that we even have
nesting in mind right away.

As you've mentioned the cgroup delegation model above. This is a good
example. The whole stick of pressure stall information (PSI) for
example, for the memory controller is the realization that instead of
pushing the policy about how to handle memory pressure every deeper into
the kernel it's better to exposes the necessary infrastructure to
userspace which can then implement policies tailored to the workload.
The kernel isn't suited for expressing such fine-grained policies. And
eBPF for containers will end up being managed in a similar way with a
system service that implements the policy for attaching eBPF programs to
containers.

The mounting of filesystem images, network filesystems and so on is imho
a similar problem. The policy when a filesystem mount should be allowed
is something that at the end of the day belongs into a userspace system
level service. The use-cases are just too many, the filesystems too
distinct and too complex to be covered by the kernel. The advantage also
is that with the system level service we can extend this ability to all
filesystems at once and to regular users on the system.

In order to give the security and resource guarantees that a modern
system needs the various services need to integrate with one another and
that may involve asking for privileged operations to be performed.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux