Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> · Mon, 6 Mar 2023 20:15:27 +0800

On 2023/3/6 19:33, Alexander Larsson wrote:
On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@xxxxxxxxxx> wrote:

On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote:

Hello,

Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
Composefs filesystem. It is an opportunistically sharing, validating
image-based filesystem, targeting usecases like validated ostree
rootfs:es, validated container images that share common files, as well
as other image based usecases.

During the discussions in the composefs proposal (as seen on LWN[3])
is has been proposed that (with some changes to overlayfs), similar
behaviour can be achieved by combining the overlayfs
"overlay.redirect" xattr with an read-only filesystem such as erofs.

There are pros and cons to both these approaches, and the discussion
about their respective value has sometimes been heated. We would like
to have an in-person discussion at the summit, ideally also involving
more of the filesystem development community, so that we can reach
some consensus on what is the best apporach.

In order to better understand the behaviour and requirements of the
overlayfs+erofs approach I spent some time implementing direct support
for erofs in libcomposefs. So, with current HEAD of
github.com/containers/composefs you can now do:

$ mkcompose --digest-store=objects --format=erofs source-dir image.erofs

This will produce an object store with the backing files, and a erofs
file with the required overlayfs xattrs, including a made up one
called "overlay.fs-verity" containing the expected fs-verity digest
for the lower dir. It also adds the required whiteouts to cover the
00-ff dirs from the lower dir.

These erofs files are ordered similarly to the composefs files, and we
give similar guarantees about their reproducibility, etc. So, they
should be apples-to-apples comparable with the composefs images.

Given this, I ran another set of performance tests on the original cs9
rootfs dataset, again measuring the time of `ls -lR`. I also tried to
measure the memory use like this:

# echo 3 > /proc/sys/vm/drop_caches
# systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
/proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'

These are the alternatives I tried:

xfs: the source of the image, regular dir on xfs
erofs: the image.erofs above, on loopback
erofs dio: the image.erofs above, on loopback with --direct-io=on
ovl: erofs above combined with overlayfs
ovl dio: erofs dio above combined with overlayfs
cfs: composefs mount of image.cfs

All tests use the same objects dir, stored on xfs. The erofs and
overlay implementations are from a stock 6.1.13 kernel, and composefs
module is from github HEAD.

I tried loopback both with and without the direct-io option, because
without direct-io enabled the kernel will double-cache the loopbacked
data, as per[1].

The produced images are:
  8.9M image.cfs
11.3M image.erofs

And gives these results:
            | Cold cache | Warm cache | Mem use
            |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
xfs        |   1449     |    442     |    54
erofs      |    700     |    391     |    45
erofs dio  |    939     |    400     |    45
ovl        |   1827     |    530     |   130
ovl dio    |   2156     |    531     |   130
cfs        |    689     |    389     |    51

It has been noted that the readahead done by kernel_read() may cause
read-ahead of unrelated data into memory which skews the results in
favour of workloads that consume all the filesystem metadata (such as
the ls -lR usecase of the above test). In the table above this favours
composefs (which uses kernel_read in some codepaths) as well as
non-dio erofs (non-dio loopback device uses readahead too).

I updated composefs to not use kernel_read here:
   https://github.com/containers/composefs/pull/105

And a new kernel patch-set based on this is available at:
   https://github.com/alexlarsson/linux/tree/composefs

The resulting table is now (dropping the non-dio erofs):

            | Cold cache | Warm cache | Mem use
            |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
xfs        |   1449     |    442     |   54
erofs dio  |    939     |    400     |   45
ovl dio    |   2156     |    531     |  130
cfs        |    833     |    398     |   51

            | Cold cache | Warm cache | Mem use
            |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
ext4       |   1135     |    394     |   54
erofs dio  |    922     |    401     |   45
ovl dio    |   1810     |    532     |  149
ovl lazy   |   1063     |    523     |  87
cfs        |    768     |    459     |  51

So, while cfs is somewhat worse now for this particular usecase, my
overall analysis still stands.

We will investigate it later, also you might still need to test some
other random workloads other than "ls -lR" (such as stat ~1000 files
randomly [1]) rather than completely ignore my and Jingbo's comments,
or at least you have to answer why "ls -lR" is the only judgement on
your side.

My point is simply simple.  If you consider a chance to get an
improved EROFS in some extents, we do hope we could improve your
"ls -lR" as much as possible without bad impacts to random access.
Or if you'd like to upstream a new file-based stackable filesystem
for this ostree specific use cases for your whatever KPIs anyway,
I don't think we could get some conclusion here and I cannot do any
help to you since I'm not that one.

Since you're addressing a very specific workload "ls -lR" and EROFS
as well as EROFS + overlayfs doesn't perform so bad without further
insights compared with Composefs even EROFS doesn't directly use
file-based interfaces.

Thanks,
Gao Xiang

[1] https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@xxxxxxxxxxxxxxxxx