On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote: > > Hello, > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the > Composefs filesystem. It is an opportunistically sharing, validating > image-based filesystem, targeting usecases like validated ostree > rootfs:es, validated container images that share common files, as well > as other image based usecases. > > During the discussions in the composefs proposal (as seen on LWN[3]) > is has been proposed that (with some changes to overlayfs), similar > behaviour can be achieved by combining the overlayfs > "overlay.redirect" xattr with an read-only filesystem such as erofs. > > There are pros and cons to both these approaches, and the discussion > about their respective value has sometimes been heated. We would like > to have an in-person discussion at the summit, ideally also involving > more of the filesystem development community, so that we can reach > some consensus on what is the best apporach. In order to better understand the behaviour and requirements of the overlayfs+erofs approach I spent some time implementing direct support for erofs in libcomposefs. So, with current HEAD of github.com/containers/composefs you can now do: $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs This will produce an object store with the backing files, and a erofs file with the required overlayfs xattrs, including a made up one called "overlay.fs-verity" containing the expected fs-verity digest for the lower dir. It also adds the required whiteouts to cover the 00-ff dirs from the lower dir. These erofs files are ordered similarly to the composefs files, and we give similar guarantees about their reproducibility, etc. So, they should be apples-to-apples comparable with the composefs images. Given this, I ran another set of performance tests on the original cs9 rootfs dataset, again measuring the time of `ls -lR`. I also tried to measure the memory use like this: # echo 3 > /proc/sys/vm/drop_caches # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' These are the alternatives I tried: xfs: the source of the image, regular dir on xfs erofs: the image.erofs above, on loopback erofs dio: the image.erofs above, on loopback with --direct-io=on ovl: erofs above combined with overlayfs ovl dio: erofs dio above combined with overlayfs cfs: composefs mount of image.cfs All tests use the same objects dir, stored on xfs. The erofs and overlay implementations are from a stock 6.1.13 kernel, and composefs module is from github HEAD. I tried loopback both with and without the direct-io option, because without direct-io enabled the kernel will double-cache the loopbacked data, as per[1]. The produced images are: 8.9M image.cfs 11.3M image.erofs And gives these results: | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- xfs | 1449 | 442 | 54 erofs | 700 | 391 | 45 erofs dio | 939 | 400 | 45 ovl | 1827 | 530 | 130 ovl dio | 2156 | 531 | 130 cfs | 689 | 389 | 51 I also ran the same tests in a VM that had the latest kernel including the lazyfollow patches (ovl lazy in the table, not using direct-io), this one ext4 based: | Cold cache | Warm cache | Mem use | (msec) | (msec) | (mb) -----------+------------+------------+--------- ext4 | 1135 | 394 | 54 erofs | 715 | 401 | 46 erofs dio | 922 | 401 | 45 ovl | 1412 | 515 | 148 ovl dio | 1810 | 532 | 149 ovl lazy | 1063 | 523 | 87 cfs | 719 | 463 | 51 Things noticeable in the results: * composefs and erofs (by itself) perform roughly similar. This is not necessarily news, and results from Jingbo Xu match this. * Erofs on top of direct-io enabled loopback causes quite a drop in performance, which I don't really understand. Especially since its reporting the same memory use as non-direct io. I guess the double-cacheing in the later case isn't properly attributed to the cgroup so the difference is not measured. However, why would the double cache improve performance? Maybe I'm not completely understanding how these things interact. * Stacking overlay on top of erofs causes about 100msec slower warm-cache times compared to all non-overlay approaches, and much more in the cold cache case. The cold cache performance is helped significantly by the lazyfollow patches, but the warm cache overhead remains. * The use of overlayfs more than doubles memory use, probably because of all the extra inodes and dentries in action for the various layers. The lazyfollow patches helps, but only partially. * Even though overlayfs+erofs is slower than cfs and raw erofs, it is not that much slower (~25%) than the pure xfs/ext4 directory, which is a pretty good baseline for comparisons. It is even faster when using lazyfollow on ext4. * The erofs images are slightly larger than the equivalent composefs image. In summary: The performance of composefs is somewhat better than the best erofs+ovl combination, although the overlay approach is not significantly worse than the baseline of a regular directory, except that it uses a bit more memory. On top of the above pure performance based comparisons I would like to re-state some of the other advantages of composefs compared to the overlay approach: * composefs is namespaceable, in the sense that you can use it (given mount capabilities) inside a namespace (such as a container) without access to non-namespaced resources like loopback or device-mapper devices. (There was work on fixing this with loopfs, but that seems to have stalled.) * While it is not in the current design, the simplicity of the format and lack of loopback makes it at least theoretically possible that composefs can be made usable in a rootless fashion at some point in the future. And of course, there are disadvantages to composefs too. Primarily being more code, increasing maintenance burden and risk of security problems. Composefs is particularly burdensome because it is a stacking filesystem and these have historically been shown to be hard to get right. The question now is what is the best approach overall? For my own primary usecase of making a verifying ostree root filesystem, the overlay approach (with the lazyfollow work finished) is, while not ideal, good enough. But I know for the people who are more interested in using composefs for containers the eventual goal of rootless support is very important. So, on behalf of them I guess the question is: Is there ever any chance that something like composefs could work rootlessly? Or conversely: Is there some way to get rootless support from the overlay approach? Opinions? Ideas? [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6 -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@xxxxxxxxxx alexander.larsson@xxxxxxxxx