Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 8 Mar 2023 11:37:48 +0100

On Tue, Mar 07, 2023 at 08:38:58AM -0500, Jeff Layton wrote:
> On Tue, 2023-03-07 at 11:15 +0100, Christian Brauner wrote:
> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> > > Hi Alexander,
> > > 
> > > On 2023/3/3 21:57, Alexander Larsson wrote:
> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote:
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> > > > > Composefs filesystem. It is an opportunistically sharing, validating
> > > > > image-based filesystem, targeting usecases like validated ostree
> > > > > rootfs:es, validated container images that share common files, as well
> > > > > as other image based usecases.
> > > > > 
> > > > > During the discussions in the composefs proposal (as seen on LWN[3])
> > > > > is has been proposed that (with some changes to overlayfs), similar
> > > > > behaviour can be achieved by combining the overlayfs
> > > > > "overlay.redirect" xattr with an read-only filesystem such as erofs.
> > > > > 
> > > > > There are pros and cons to both these approaches, and the discussion
> > > > > about their respective value has sometimes been heated. We would like
> > > > > to have an in-person discussion at the summit, ideally also involving
> > > > > more of the filesystem development community, so that we can reach
> > > > > some consensus on what is the best apporach.
> > > > 
> > > > In order to better understand the behaviour and requirements of the
> > > > overlayfs+erofs approach I spent some time implementing direct support
> > > > for erofs in libcomposefs. So, with current HEAD of
> > > > github.com/containers/composefs you can now do:
> > > > 
> > > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
> > > 
> > > Thanks you for taking time on working on EROFS support.  I don't have
> > > time to play with it yet since I'd like to work out erofs-utils 1.6
> > > these days and will work on some new stuffs such as !pagesize block
> > > size as I said previously.
> > > 
> > > > 
> > > > This will produce an object store with the backing files, and a erofs
> > > > file with the required overlayfs xattrs, including a made up one
> > > > called "overlay.fs-verity" containing the expected fs-verity digest
> > > > for the lower dir. It also adds the required whiteouts to cover the
> > > > 00-ff dirs from the lower dir.
> > > > 
> > > > These erofs files are ordered similarly to the composefs files, and we
> > > > give similar guarantees about their reproducibility, etc. So, they
> > > > should be apples-to-apples comparable with the composefs images.
> > > > 
> > > > Given this, I ran another set of performance tests on the original cs9
> > > > rootfs dataset, again measuring the time of `ls -lR`. I also tried to
> > > > measure the memory use like this:
> > > > 
> > > > # echo 3 > /proc/sys/vm/drop_caches
> > > > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
> > > > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> > > > 
> > > > These are the alternatives I tried:
> > > > 
> > > > xfs: the source of the image, regular dir on xfs
> > > > erofs: the image.erofs above, on loopback
> > > > erofs dio: the image.erofs above, on loopback with --direct-io=on
> > > > ovl: erofs above combined with overlayfs
> > > > ovl dio: erofs dio above combined with overlayfs
> > > > cfs: composefs mount of image.cfs
> > > > 
> > > > All tests use the same objects dir, stored on xfs. The erofs and
> > > > overlay implementations are from a stock 6.1.13 kernel, and composefs
> > > > module is from github HEAD.
> > > > 
> > > > I tried loopback both with and without the direct-io option, because
> > > > without direct-io enabled the kernel will double-cache the loopbacked
> > > > data, as per[1].
> > > > 
> > > > The produced images are:
> > > >   8.9M image.cfs
> > > > 11.3M image.erofs
> > > > 
> > > > And gives these results:
> > > >             | Cold cache | Warm cache | Mem use
> > > >             |   (msec)   |   (msec)   |  (mb)
> > > > -----------+------------+------------+---------
> > > > xfs        |   1449     |    442     |    54
> > > > erofs      |    700     |    391     |    45
> > > > erofs dio  |    939     |    400     |    45
> > > > ovl        |   1827     |    530     |   130
> > > > ovl dio    |   2156     |    531     |   130
> > > > cfs        |    689     |    389     |    51
> > > > 
> > > > I also ran the same tests in a VM that had the latest kernel including
> > > > the lazyfollow patches (ovl lazy in the table, not using direct-io),
> > > > this one ext4 based:
> > > > 
> > > >             | Cold cache | Warm cache | Mem use
> > > >             |   (msec)   |   (msec)   |  (mb)
> > > > -----------+------------+------------+---------
> > > > ext4       |   1135     |    394     |    54
> > > > erofs      |    715     |    401     |    46
> > > > erofs dio  |    922     |    401     |    45
> > > > ovl        |   1412     |    515     |   148
> > > > ovl dio    |   1810     |    532     |   149
> > > > ovl lazy   |   1063     |    523     |    87
> > > > cfs        |    719     |    463     |    51
> > > > 
> > > > Things noticeable in the results:
> > > > 
> > > > * composefs and erofs (by itself) perform roughly  similar. This is
> > > >    not necessarily news, and results from Jingbo Xu match this.
> > > > 
> > > > * Erofs on top of direct-io enabled loopback causes quite a drop in
> > > >    performance, which I don't really understand. Especially since its
> > > >    reporting the same memory use as non-direct io. I guess the
> > > >    double-cacheing in the later case isn't properly attributed to the
> > > >    cgroup so the difference is not measured. However, why would the
> > > >    double cache improve performance?  Maybe I'm not completely
> > > >    understanding how these things interact.
> > > 
> > > We've already analysed the root cause of composefs is that composefs
> > > uses a kernel_read() to read its path while irrelevant metadata
> > > (such as dir data) is read together.  Such heuristic readahead is a
> > > unusual stuff for all local fses (obviously almost all in-kernel
> > > filesystems don't use kernel_read() to read their metadata. Although
> > > some filesystems could readahead some related extent metadata when
> > > reading inode, they at least does _not_ work as kernel_read().) But
> > > double caching will introduce almost the same impact as kernel_read()
> > > (assuming you read some source code of loop device.)
> > > 
> > > I do hope you already read what Jingbo's latest test results, and that
> > > test result shows how bad readahead performs if fs metadata is
> > > partially randomly used (stat < 1500 files):
> > > https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@xxxxxxxxxxxxxxxxx
> > > 
> > > Also you could explicitly _disable_ readahead for composefs
> > > manifiest file (because all EROFS metadata read is without
> > > readahead), and let's see how it works then.
> > > 
> > > Again, if your workload is just "ls -lR".  My answer is "just async
> > > readahead the whole manifest file / loop device together" when
> > > mounting.  That will give the best result to you.  But I'm not sure
> > > that is the real use case you propose.
> > > 
> > > > 
> > > > * Stacking overlay on top of erofs causes about 100msec slower
> > > >    warm-cache times compared to all non-overlay approaches, and much
> > > >    more in the cold cache case. The cold cache performance is helped
> > > >    significantly by the lazyfollow patches, but the warm cache overhead
> > > >    remains.
> > > > 
> > > > * The use of overlayfs more than doubles memory use, probably
> > > >    because of all the extra inodes and dentries in action for the
> > > >    various layers. The lazyfollow patches helps, but only partially.
> > > > 
> > > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is
> > > >    not that much slower (~25%) than the pure xfs/ext4 directory, which
> > > >    is a pretty good baseline for comparisons. It is even faster when
> > > >    using lazyfollow on ext4.
> > > > 
> > > > * The erofs images are slightly larger than the equivalent composefs
> > > >    image.
> > > > 
> > > > In summary: The performance of composefs is somewhat better than the
> > > > best erofs+ovl combination, although the overlay approach is not
> > > > significantly worse than the baseline of a regular directory, except
> > > > that it uses a bit more memory.
> > > > 
> > > > On top of the above pure performance based comparisons I would like to
> > > > re-state some of the other advantages of composefs compared to the
> > > > overlay approach:
> > > > 
> > > > * composefs is namespaceable, in the sense that you can use it (given
> > > >    mount capabilities) inside a namespace (such as a container) without
> > > >    access to non-namespaced resources like loopback or device-mapper
> > > >    devices. (There was work on fixing this with loopfs, but that seems
> > > >    to have stalled.)
> > > > 
> > > > * While it is not in the current design, the simplicity of the format
> > > >    and lack of loopback makes it at least theoretically possible that
> > > >    composefs can be made usable in a rootless fashion at some point in
> > > >    the future.
> > > Do you consider sending some commands to /dev/cachefiles to configure
> > > a daemonless dir and mount erofs image directly by using "erofs over
> > > fscache" but in a daemonless way?  That is an ongoing stuff on our side.
> > > 
> > > IMHO, I don't think file-based interfaces are quite a charmful stuff.
> > > Historically I recalled some practice is to "avoid directly reading
> > > files in kernel" so that I think almost all local fses don't work on
> > > files directl and loopback devices are all the ways for these use
> > > cases.  If loopback devices are not okay to you, how about improving
> > > loopback devices and that will benefit to almost all local fses.
> > > 
> > > > 
> > > > And of course, there are disadvantages to composefs too. Primarily
> > > > being more code, increasing maintenance burden and risk of security
> > > > problems. Composefs is particularly burdensome because it is a
> > > > stacking filesystem and these have historically been shown to be hard
> > > > to get right.
> > > > 
> > > > 
> > > > The question now is what is the best approach overall? For my own
> > > > primary usecase of making a verifying ostree root filesystem, the
> > > > overlay approach (with the lazyfollow work finished) is, while not
> > > > ideal, good enough.
> > > 
> > > So your judgement is still "ls -lR" and your use case is still just
> > > pure read-only and without writable stuff?
> > > 
> > > Anyway, I'm really happy to work with you on your ostree use cases
> > > as always, as long as all corner cases work out by the community.
> > > 
> > > > 
> > > > But I know for the people who are more interested in using composefs
> > > > for containers the eventual goal of rootless support is very
> > > > important. So, on behalf of them I guess the question is: Is there
> > > > ever any chance that something like composefs could work rootlessly?
> > > > Or conversely: Is there some way to get rootless support from the
> > > > overlay approach? Opinions? Ideas?
> > > 
> > > Honestly, I do want to get a proper answer when Giuseppe asked me
> > > the same question.  My current view is simply "that question is
> > > almost the same for all in-kernel fses with some on-disk format".
> > 
> > As far as I'm concerned filesystems with on-disk format will not be made
> > mountable by unprivileged containers. And I don't think I'm alone in
> > that view.
> > 
> 
> You're absolutely not alone in that view. This is even more unsafe with
> network and clustered filesystems, as you're trusting remote hardware
> that is accessible by other users than just the local host. We have had
> long-standing open requests to allow unprivileged users to mount
> arbitrary remote filesystems, and I've never seen a way to do that
> safely.
> 
> > The idea that ever more parts of the kernel with a massive
> > attack surface such as a filesystem need to vouchesafe for the safety in
> > the face of every rando having access to
> > unshare --mount --user --map-root is a dead end and will just end up
> > trapping us in a neverending cycle of security bugs (Because every
> > single bug that's found after making that fs mountable from an
> > unprivileged container will be treated as a security bug no matter if
> > justified or not. So this is also a good way to ruin your filesystem's
> > reputation.).
> > 
> > And honestly, if we set the precedent that it's fine for one filesystem
> > with an on-disk format to be able to be mounted by unprivileged
> > containers then other filesystems eventually want to do this as well.
> > 
> > At the rate we currently add filesystems that's just a matter of time
> > even if none of the existing ones would also want to do it. And then
> > we're left arguing that this was just an exception for one super
> > special, super safe, unexploitable filesystem with an on-disk format.
> > 
> > Imho, none of this is appealing. I don't want to slowly keep building a
> > future where we end up running fuzzers in unprivileged container to
> > generate random images to crash the kernel.
> > 
> > I have more arguments why I don't think is a path we will ever go down
> > but I don't want this to detract from the legitimate ask of making it
> > possible to mount trusted images from within unprivileged containers.
> > Because I think that's perfectly legitimate.
> > 
> > However, I don't think that this is something the kernel needs to solve
> > other than providing the necessary infrastructure so that this can be
> > solved in userspace.
> > 
> > Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I
> > explained how we currently mount into mount namespaces of unprivileged
> > cotainers which had been quite a difficult problem before the new mount
> > api. But now it's become almost comically trivial. I mean, there's stuff
> > that will still be good to have but overall all the bits are already
> > there.
> > 
> > Imho, delegated mounting should be done by a system service that is
> > responsible for all the steps that require privileges. So for most
> > filesytems not mountable by unprivileged user this would amount to:
> > 
> > fd_fs = fsopen("xfs")
> > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> > fsconfig(FSCONFIG_CMD_CREATE)
> > fd_mnt = fsmount(fd_fs)
> > // Only required for attributes that require privileges against the sb
> > // of the filesystem such as idmapped mounts
> > mount_setattr(fd_mnt, ...)
> > 
> > and then the fd_mnt can be sent to the container which can then attach
> > it wherever it wants to. The system level service doesn't even need to
> > change namespaces via setns(fd_userns|fd_mntns) like I illustrated in
> > the post I did. It's sufficient if we sent it via AF_UNIX for example
> > that's exposed to the container.
> > 
> > Of course, this system level service would be integrated with mount(8)
> > directly over a well-defined protocol. And this would be nestable as
> > well by e.g., bind-mounting the AF_UNIX socket.
> > 
> > And we do already support a rudimentary form of such integration through
> > systemd. For example via mount -t ddi (cf. [2]) which makes it possible
> > to mount discoverable disk images (ddi). But that's just an
> > illustration. 
> > 
> > This should be integrated with mount(8) and should be a simply protocol
> > over varlink or another lightweight ipc mechanism that can be
> > implemented by systemd-mountd (which is how I coined this for lack of
> > imagination when I came up with this) or by some other component if
> > platforms like k8s really want to do their own thing.
> > 
> > This also allows us to extend this feature to the whole system btw and
> > to all filesystems at once. Because it means that if systemd-mountd is
> > told what images to trust (based on location, from a specific registry,
> > signature, or whatever) then this isn't just useful for unprivileged
> > containers but also for regular users on the host that want to mount
> > stuff.
> > 
> > This is what we're currently working on.
> > 
> 
> This is a very cool idea, and sounds like a reasonable way forward. I'd
> be interested to hear more about this (and in particular what sort of
> security model and use-cases you envision for this).

I convinced Lennart to put this on the top of his todo so he'll
hopefully finish the first implementation within the next week and put
up a PR. By LSFMM we should be able to demo this.