Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay

Alexander Larsson <alexl@xxxxxxxxxx> · Fri, 3 Mar 2023 14:57:11 +0100

On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote:
>
> Hello,
>
> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> Composefs filesystem. It is an opportunistically sharing, validating
> image-based filesystem, targeting usecases like validated ostree
> rootfs:es, validated container images that share common files, as well
> as other image based usecases.
>
> During the discussions in the composefs proposal (as seen on LWN[3])
> is has been proposed that (with some changes to overlayfs), similar
> behaviour can be achieved by combining the overlayfs
> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>
> There are pros and cons to both these approaches, and the discussion
> about their respective value has sometimes been heated. We would like
> to have an in-person discussion at the summit, ideally also involving
> more of the filesystem development community, so that we can reach
> some consensus on what is the best apporach.

In order to better understand the behaviour and requirements of the
overlayfs+erofs approach I spent some time implementing direct support
for erofs in libcomposefs. So, with current HEAD of
github.com/containers/composefs you can now do:

$ mkcompose --digest-store=objects --format=erofs source-dir image.erofs

This will produce an object store with the backing files, and a erofs
file with the required overlayfs xattrs, including a made up one
called "overlay.fs-verity" containing the expected fs-verity digest
for the lower dir. It also adds the required whiteouts to cover the
00-ff dirs from the lower dir.

These erofs files are ordered similarly to the composefs files, and we
give similar guarantees about their reproducibility, etc. So, they
should be apples-to-apples comparable with the composefs images.

Given this, I ran another set of performance tests on the original cs9
rootfs dataset, again measuring the time of `ls -lR`. I also tried to
measure the memory use like this:

# echo 3 > /proc/sys/vm/drop_caches
# systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
/proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'

These are the alternatives I tried:

xfs: the source of the image, regular dir on xfs
erofs: the image.erofs above, on loopback
erofs dio: the image.erofs above, on loopback with --direct-io=on
ovl: erofs above combined with overlayfs
ovl dio: erofs dio above combined with overlayfs
cfs: composefs mount of image.cfs

All tests use the same objects dir, stored on xfs. The erofs and
overlay implementations are from a stock 6.1.13 kernel, and composefs
module is from github HEAD.

I tried loopback both with and without the direct-io option, because
without direct-io enabled the kernel will double-cache the loopbacked
data, as per[1].

The produced images are:
 8.9M image.cfs
11.3M image.erofs

And gives these results:
           | Cold cache | Warm cache | Mem use
           |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
xfs        |   1449     |    442     |    54
erofs      |    700     |    391     |    45
erofs dio  |    939     |    400     |    45
ovl        |   1827     |    530     |   130
ovl dio    |   2156     |    531     |   130
cfs        |    689     |    389     |    51

I also ran the same tests in a VM that had the latest kernel including
the lazyfollow patches (ovl lazy in the table, not using direct-io),
this one ext4 based:

           | Cold cache | Warm cache | Mem use
           |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
ext4       |   1135     |    394     |    54
erofs      |    715     |    401     |    46
erofs dio  |    922     |    401     |    45
ovl        |   1412     |    515     |   148
ovl dio    |   1810     |    532     |   149
ovl lazy   |   1063     |    523     |    87
cfs        |    719     |    463     |    51

Things noticeable in the results:

* composefs and erofs (by itself) perform roughly  similar. This is
  not necessarily news, and results from Jingbo Xu match this.

* Erofs on top of direct-io enabled loopback causes quite a drop in
  performance, which I don't really understand. Especially since its
  reporting the same memory use as non-direct io. I guess the
  double-cacheing in the later case isn't properly attributed to the
  cgroup so the difference is not measured. However, why would the
  double cache improve performance?  Maybe I'm not completely
  understanding how these things interact.

* Stacking overlay on top of erofs causes about 100msec slower
  warm-cache times compared to all non-overlay approaches, and much
  more in the cold cache case. The cold cache performance is helped
  significantly by the lazyfollow patches, but the warm cache overhead
  remains.

* The use of overlayfs more than doubles memory use, probably
  because of all the extra inodes and dentries in action for the
  various layers. The lazyfollow patches helps, but only partially.

* Even though overlayfs+erofs is slower than cfs and raw erofs, it is
  not that much slower (~25%) than the pure xfs/ext4 directory, which
  is a pretty good baseline for comparisons. It is even faster when
  using lazyfollow on ext4.

* The erofs images are slightly larger than the equivalent composefs
  image.

In summary: The performance of composefs is somewhat better than the
best erofs+ovl combination, although the overlay approach is not
significantly worse than the baseline of a regular directory, except
that it uses a bit more memory.

On top of the above pure performance based comparisons I would like to
re-state some of the other advantages of composefs compared to the
overlay approach:

* composefs is namespaceable, in the sense that you can use it (given
  mount capabilities) inside a namespace (such as a container) without
  access to non-namespaced resources like loopback or device-mapper
  devices. (There was work on fixing this with loopfs, but that seems
  to have stalled.)

* While it is not in the current design, the simplicity of the format
  and lack of loopback makes it at least theoretically possible that
  composefs can be made usable in a rootless fashion at some point in
  the future.

And of course, there are disadvantages to composefs too. Primarily
being more code, increasing maintenance burden and risk of security
problems. Composefs is particularly burdensome because it is a
stacking filesystem and these have historically been shown to be hard
to get right.

The question now is what is the best approach overall? For my own
primary usecase of making a verifying ostree root filesystem, the
overlay approach (with the lazyfollow work finished) is, while not
ideal, good enough.

But I know for the people who are more interested in using composefs
for containers the eventual goal of rootless support is very
important. So, on behalf of them I guess the question is: Is there
ever any chance that something like composefs could work rootlessly?
Or conversely: Is there some way to get rootless support from the
overlay approach? Opinions? Ideas?

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@xxxxxxxxxx         alexander.larsson@xxxxxxxxx