On Tue, Mar 07, 2023 at 08:38:58AM -0500, Jeff Layton wrote: > On Tue, 2023-03-07 at 11:15 +0100, Christian Brauner wrote: > > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote: > > > Hi Alexander, > > > > > > On 2023/3/3 21:57, Alexander Larsson wrote: > > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote: > > > > > > > > > > Hello, > > > > > > > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > > > > > Composefs filesystem. It is an opportunistically sharing, validating > > > > > image-based filesystem, targeting usecases like validated ostree > > > > > rootfs:es, validated container images that share common files, as well > > > > > as other image based usecases. > > > > > > > > > > During the discussions in the composefs proposal (as seen on LWN[3]) > > > > > is has been proposed that (with some changes to overlayfs), similar > > > > > behaviour can be achieved by combining the overlayfs > > > > > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > > > > > > > > > There are pros and cons to both these approaches, and the discussion > > > > > about their respective value has sometimes been heated. We would like > > > > > to have an in-person discussion at the summit, ideally also involving > > > > > more of the filesystem development community, so that we can reach > > > > > some consensus on what is the best apporach. > > > > > > > > In order to better understand the behaviour and requirements of the > > > > overlayfs+erofs approach I spent some time implementing direct support > > > > for erofs in libcomposefs. So, with current HEAD of > > > > github.com/containers/composefs you can now do: > > > > > > > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs > > > > > > Thanks you for taking time on working on EROFS support. I don't have > > > time to play with it yet since I'd like to work out erofs-utils 1.6 > > > these days and will work on some new stuffs such as !pagesize block > > > size as I said previously. > > > > > > > > > > > This will produce an object store with the backing files, and a erofs > > > > file with the required overlayfs xattrs, including a made up one > > > > called "overlay.fs-verity" containing the expected fs-verity digest > > > > for the lower dir. It also adds the required whiteouts to cover the > > > > 00-ff dirs from the lower dir. > > > > > > > > These erofs files are ordered similarly to the composefs files, and we > > > > give similar guarantees about their reproducibility, etc. So, they > > > > should be apples-to-apples comparable with the composefs images. > > > > > > > > Given this, I ran another set of performance tests on the original cs9 > > > > rootfs dataset, again measuring the time of `ls -lR`. I also tried to > > > > measure the memory use like this: > > > > > > > > # echo 3 > /proc/sys/vm/drop_caches > > > > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat > > > > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > > > > > > > > These are the alternatives I tried: > > > > > > > > xfs: the source of the image, regular dir on xfs > > > > erofs: the image.erofs above, on loopback > > > > erofs dio: the image.erofs above, on loopback with --direct-io=on > > > > ovl: erofs above combined with overlayfs > > > > ovl dio: erofs dio above combined with overlayfs > > > > cfs: composefs mount of image.cfs > > > > > > > > All tests use the same objects dir, stored on xfs. The erofs and > > > > overlay implementations are from a stock 6.1.13 kernel, and composefs > > > > module is from github HEAD. > > > > > > > > I tried loopback both with and without the direct-io option, because > > > > without direct-io enabled the kernel will double-cache the loopbacked > > > > data, as per[1]. > > > > > > > > The produced images are: > > > > 8.9M image.cfs > > > > 11.3M image.erofs > > > > > > > > And gives these results: > > > > | Cold cache | Warm cache | Mem use > > > > | (msec) | (msec) | (mb) > > > > -----------+------------+------------+--------- > > > > xfs | 1449 | 442 | 54 > > > > erofs | 700 | 391 | 45 > > > > erofs dio | 939 | 400 | 45 > > > > ovl | 1827 | 530 | 130 > > > > ovl dio | 2156 | 531 | 130 > > > > cfs | 689 | 389 | 51 > > > > > > > > I also ran the same tests in a VM that had the latest kernel including > > > > the lazyfollow patches (ovl lazy in the table, not using direct-io), > > > > this one ext4 based: > > > > > > > > | Cold cache | Warm cache | Mem use > > > > | (msec) | (msec) | (mb) > > > > -----------+------------+------------+--------- > > > > ext4 | 1135 | 394 | 54 > > > > erofs | 715 | 401 | 46 > > > > erofs dio | 922 | 401 | 45 > > > > ovl | 1412 | 515 | 148 > > > > ovl dio | 1810 | 532 | 149 > > > > ovl lazy | 1063 | 523 | 87 > > > > cfs | 719 | 463 | 51 > > > > > > > > Things noticeable in the results: > > > > > > > > * composefs and erofs (by itself) perform roughly similar. This is > > > > not necessarily news, and results from Jingbo Xu match this. > > > > > > > > * Erofs on top of direct-io enabled loopback causes quite a drop in > > > > performance, which I don't really understand. Especially since its > > > > reporting the same memory use as non-direct io. I guess the > > > > double-cacheing in the later case isn't properly attributed to the > > > > cgroup so the difference is not measured. However, why would the > > > > double cache improve performance? Maybe I'm not completely > > > > understanding how these things interact. > > > > > > We've already analysed the root cause of composefs is that composefs > > > uses a kernel_read() to read its path while irrelevant metadata > > > (such as dir data) is read together. Such heuristic readahead is a > > > unusual stuff for all local fses (obviously almost all in-kernel > > > filesystems don't use kernel_read() to read their metadata. Although > > > some filesystems could readahead some related extent metadata when > > > reading inode, they at least does _not_ work as kernel_read().) But > > > double caching will introduce almost the same impact as kernel_read() > > > (assuming you read some source code of loop device.) > > > > > > I do hope you already read what Jingbo's latest test results, and that > > > test result shows how bad readahead performs if fs metadata is > > > partially randomly used (stat < 1500 files): > > > https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@xxxxxxxxxxxxxxxxx > > > > > > Also you could explicitly _disable_ readahead for composefs > > > manifiest file (because all EROFS metadata read is without > > > readahead), and let's see how it works then. > > > > > > Again, if your workload is just "ls -lR". My answer is "just async > > > readahead the whole manifest file / loop device together" when > > > mounting. That will give the best result to you. But I'm not sure > > > that is the real use case you propose. > > > > > > > > > > > * Stacking overlay on top of erofs causes about 100msec slower > > > > warm-cache times compared to all non-overlay approaches, and much > > > > more in the cold cache case. The cold cache performance is helped > > > > significantly by the lazyfollow patches, but the warm cache overhead > > > > remains. > > > > > > > > * The use of overlayfs more than doubles memory use, probably > > > > because of all the extra inodes and dentries in action for the > > > > various layers. The lazyfollow patches helps, but only partially. > > > > > > > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is > > > > not that much slower (~25%) than the pure xfs/ext4 directory, which > > > > is a pretty good baseline for comparisons. It is even faster when > > > > using lazyfollow on ext4. > > > > > > > > * The erofs images are slightly larger than the equivalent composefs > > > > image. > > > > > > > > In summary: The performance of composefs is somewhat better than the > > > > best erofs+ovl combination, although the overlay approach is not > > > > significantly worse than the baseline of a regular directory, except > > > > that it uses a bit more memory. > > > > > > > > On top of the above pure performance based comparisons I would like to > > > > re-state some of the other advantages of composefs compared to the > > > > overlay approach: > > > > > > > > * composefs is namespaceable, in the sense that you can use it (given > > > > mount capabilities) inside a namespace (such as a container) without > > > > access to non-namespaced resources like loopback or device-mapper > > > > devices. (There was work on fixing this with loopfs, but that seems > > > > to have stalled.) > > > > > > > > * While it is not in the current design, the simplicity of the format > > > > and lack of loopback makes it at least theoretically possible that > > > > composefs can be made usable in a rootless fashion at some point in > > > > the future. > > > Do you consider sending some commands to /dev/cachefiles to configure > > > a daemonless dir and mount erofs image directly by using "erofs over > > > fscache" but in a daemonless way? That is an ongoing stuff on our side. > > > > > > IMHO, I don't think file-based interfaces are quite a charmful stuff. > > > Historically I recalled some practice is to "avoid directly reading > > > files in kernel" so that I think almost all local fses don't work on > > > files directl and loopback devices are all the ways for these use > > > cases. If loopback devices are not okay to you, how about improving > > > loopback devices and that will benefit to almost all local fses. > > > > > > > > > > > And of course, there are disadvantages to composefs too. Primarily > > > > being more code, increasing maintenance burden and risk of security > > > > problems. Composefs is particularly burdensome because it is a > > > > stacking filesystem and these have historically been shown to be hard > > > > to get right. > > > > > > > > > > > > The question now is what is the best approach overall? For my own > > > > primary usecase of making a verifying ostree root filesystem, the > > > > overlay approach (with the lazyfollow work finished) is, while not > > > > ideal, good enough. > > > > > > So your judgement is still "ls -lR" and your use case is still just > > > pure read-only and without writable stuff? > > > > > > Anyway, I'm really happy to work with you on your ostree use cases > > > as always, as long as all corner cases work out by the community. > > > > > > > > > > > But I know for the people who are more interested in using composefs > > > > for containers the eventual goal of rootless support is very > > > > important. So, on behalf of them I guess the question is: Is there > > > > ever any chance that something like composefs could work rootlessly? > > > > Or conversely: Is there some way to get rootless support from the > > > > overlay approach? Opinions? Ideas? > > > > > > Honestly, I do want to get a proper answer when Giuseppe asked me > > > the same question. My current view is simply "that question is > > > almost the same for all in-kernel fses with some on-disk format". > > > > As far as I'm concerned filesystems with on-disk format will not be made > > mountable by unprivileged containers. And I don't think I'm alone in > > that view. > > > > You're absolutely not alone in that view. This is even more unsafe with > network and clustered filesystems, as you're trusting remote hardware > that is accessible by other users than just the local host. We have had > long-standing open requests to allow unprivileged users to mount > arbitrary remote filesystems, and I've never seen a way to do that > safely. > > > The idea that ever more parts of the kernel with a massive > > attack surface such as a filesystem need to vouchesafe for the safety in > > the face of every rando having access to > > unshare --mount --user --map-root is a dead end and will just end up > > trapping us in a neverending cycle of security bugs (Because every > > single bug that's found after making that fs mountable from an > > unprivileged container will be treated as a security bug no matter if > > justified or not. So this is also a good way to ruin your filesystem's > > reputation.). > > > > And honestly, if we set the precedent that it's fine for one filesystem > > with an on-disk format to be able to be mounted by unprivileged > > containers then other filesystems eventually want to do this as well. > > > > At the rate we currently add filesystems that's just a matter of time > > even if none of the existing ones would also want to do it. And then > > we're left arguing that this was just an exception for one super > > special, super safe, unexploitable filesystem with an on-disk format. > > > > Imho, none of this is appealing. I don't want to slowly keep building a > > future where we end up running fuzzers in unprivileged container to > > generate random images to crash the kernel. > > > > I have more arguments why I don't think is a path we will ever go down > > but I don't want this to detract from the legitimate ask of making it > > possible to mount trusted images from within unprivileged containers. > > Because I think that's perfectly legitimate. > > > > However, I don't think that this is something the kernel needs to solve > > other than providing the necessary infrastructure so that this can be > > solved in userspace. > > > > Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I > > explained how we currently mount into mount namespaces of unprivileged > > cotainers which had been quite a difficult problem before the new mount > > api. But now it's become almost comically trivial. I mean, there's stuff > > that will still be good to have but overall all the bits are already > > there. > > > > Imho, delegated mounting should be done by a system service that is > > responsible for all the steps that require privileges. So for most > > filesytems not mountable by unprivileged user this would amount to: > > > > fd_fs = fsopen("xfs") > > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm") > > fsconfig(FSCONFIG_CMD_CREATE) > > fd_mnt = fsmount(fd_fs) > > // Only required for attributes that require privileges against the sb > > // of the filesystem such as idmapped mounts > > mount_setattr(fd_mnt, ...) > > > > and then the fd_mnt can be sent to the container which can then attach > > it wherever it wants to. The system level service doesn't even need to > > change namespaces via setns(fd_userns|fd_mntns) like I illustrated in > > the post I did. It's sufficient if we sent it via AF_UNIX for example > > that's exposed to the container. > > > > Of course, this system level service would be integrated with mount(8) > > directly over a well-defined protocol. And this would be nestable as > > well by e.g., bind-mounting the AF_UNIX socket. > > > > And we do already support a rudimentary form of such integration through > > systemd. For example via mount -t ddi (cf. [2]) which makes it possible > > to mount discoverable disk images (ddi). But that's just an > > illustration. > > > > This should be integrated with mount(8) and should be a simply protocol > > over varlink or another lightweight ipc mechanism that can be > > implemented by systemd-mountd (which is how I coined this for lack of > > imagination when I came up with this) or by some other component if > > platforms like k8s really want to do their own thing. > > > > This also allows us to extend this feature to the whole system btw and > > to all filesystems at once. Because it means that if systemd-mountd is > > told what images to trust (based on location, from a specific registry, > > signature, or whatever) then this isn't just useful for unprivileged > > containers but also for regular users on the host that want to mount > > stuff. > > > > This is what we're currently working on. > > > > This is a very cool idea, and sounds like a reasonable way forward. I'd > be interested to hear more about this (and in particular what sort of > security model and use-cases you envision for this). I convinced Lennart to put this on the top of his todo so he'll hopefully finish the first implementation within the next week and put up a PR. By LSFMM we should be able to demo this.