Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Christian,

On 2023/3/7 18:15, Christian Brauner wrote:
On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
Hi Alexander,

On 2023/3/3 21:57, Alexander Larsson wrote:
On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@xxxxxxxxxx> wrote:

Hello,

Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
Composefs filesystem. It is an opportunistically sharing, validating
image-based filesystem, targeting usecases like validated ostree
rootfs:es, validated container images that share common files, as well
as other image based usecases.

During the discussions in the composefs proposal (as seen on LWN[3])
is has been proposed that (with some changes to overlayfs), similar
behaviour can be achieved by combining the overlayfs
"overlay.redirect" xattr with an read-only filesystem such as erofs.

There are pros and cons to both these approaches, and the discussion
about their respective value has sometimes been heated. We would like
to have an in-person discussion at the summit, ideally also involving
more of the filesystem development community, so that we can reach
some consensus on what is the best apporach.

In order to better understand the behaviour and requirements of the
overlayfs+erofs approach I spent some time implementing direct support
for erofs in libcomposefs. So, with current HEAD of
github.com/containers/composefs you can now do:

$ mkcompose --digest-store=objects --format=erofs source-dir image.erofs

Thanks you for taking time on working on EROFS support.  I don't have
time to play with it yet since I'd like to work out erofs-utils 1.6
these days and will work on some new stuffs such as !pagesize block
size as I said previously.


This will produce an object store with the backing files, and a erofs
file with the required overlayfs xattrs, including a made up one
called "overlay.fs-verity" containing the expected fs-verity digest
for the lower dir. It also adds the required whiteouts to cover the
00-ff dirs from the lower dir.

These erofs files are ordered similarly to the composefs files, and we
give similar guarantees about their reproducibility, etc. So, they
should be apples-to-apples comparable with the composefs images.

Given this, I ran another set of performance tests on the original cs9
rootfs dataset, again measuring the time of `ls -lR`. I also tried to
measure the memory use like this:

# echo 3 > /proc/sys/vm/drop_caches
# systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
/proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'

These are the alternatives I tried:

xfs: the source of the image, regular dir on xfs
erofs: the image.erofs above, on loopback
erofs dio: the image.erofs above, on loopback with --direct-io=on
ovl: erofs above combined with overlayfs
ovl dio: erofs dio above combined with overlayfs
cfs: composefs mount of image.cfs

All tests use the same objects dir, stored on xfs. The erofs and
overlay implementations are from a stock 6.1.13 kernel, and composefs
module is from github HEAD.

I tried loopback both with and without the direct-io option, because
without direct-io enabled the kernel will double-cache the loopbacked
data, as per[1].

The produced images are:
   8.9M image.cfs
11.3M image.erofs

And gives these results:
             | Cold cache | Warm cache | Mem use
             |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
xfs        |   1449     |    442     |    54
erofs      |    700     |    391     |    45
erofs dio  |    939     |    400     |    45
ovl        |   1827     |    530     |   130
ovl dio    |   2156     |    531     |   130
cfs        |    689     |    389     |    51

I also ran the same tests in a VM that had the latest kernel including
the lazyfollow patches (ovl lazy in the table, not using direct-io),
this one ext4 based:

             | Cold cache | Warm cache | Mem use
             |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
ext4       |   1135     |    394     |    54
erofs      |    715     |    401     |    46
erofs dio  |    922     |    401     |    45
ovl        |   1412     |    515     |   148
ovl dio    |   1810     |    532     |   149
ovl lazy   |   1063     |    523     |    87
cfs        |    719     |    463     |    51

Things noticeable in the results:

* composefs and erofs (by itself) perform roughly  similar. This is
    not necessarily news, and results from Jingbo Xu match this.

* Erofs on top of direct-io enabled loopback causes quite a drop in
    performance, which I don't really understand. Especially since its
    reporting the same memory use as non-direct io. I guess the
    double-cacheing in the later case isn't properly attributed to the
    cgroup so the difference is not measured. However, why would the
    double cache improve performance?  Maybe I'm not completely
    understanding how these things interact.

We've already analysed the root cause of composefs is that composefs
uses a kernel_read() to read its path while irrelevant metadata
(such as dir data) is read together.  Such heuristic readahead is a
unusual stuff for all local fses (obviously almost all in-kernel
filesystems don't use kernel_read() to read their metadata. Although
some filesystems could readahead some related extent metadata when
reading inode, they at least does _not_ work as kernel_read().) But
double caching will introduce almost the same impact as kernel_read()
(assuming you read some source code of loop device.)

I do hope you already read what Jingbo's latest test results, and that
test result shows how bad readahead performs if fs metadata is
partially randomly used (stat < 1500 files):
https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@xxxxxxxxxxxxxxxxx

Also you could explicitly _disable_ readahead for composefs
manifiest file (because all EROFS metadata read is without
readahead), and let's see how it works then.

Again, if your workload is just "ls -lR".  My answer is "just async
readahead the whole manifest file / loop device together" when
mounting.  That will give the best result to you.  But I'm not sure
that is the real use case you propose.


* Stacking overlay on top of erofs causes about 100msec slower
    warm-cache times compared to all non-overlay approaches, and much
    more in the cold cache case. The cold cache performance is helped
    significantly by the lazyfollow patches, but the warm cache overhead
    remains.

* The use of overlayfs more than doubles memory use, probably
    because of all the extra inodes and dentries in action for the
    various layers. The lazyfollow patches helps, but only partially.

* Even though overlayfs+erofs is slower than cfs and raw erofs, it is
    not that much slower (~25%) than the pure xfs/ext4 directory, which
    is a pretty good baseline for comparisons. It is even faster when
    using lazyfollow on ext4.

* The erofs images are slightly larger than the equivalent composefs
    image.

In summary: The performance of composefs is somewhat better than the
best erofs+ovl combination, although the overlay approach is not
significantly worse than the baseline of a regular directory, except
that it uses a bit more memory.

On top of the above pure performance based comparisons I would like to
re-state some of the other advantages of composefs compared to the
overlay approach:

* composefs is namespaceable, in the sense that you can use it (given
    mount capabilities) inside a namespace (such as a container) without
    access to non-namespaced resources like loopback or device-mapper
    devices. (There was work on fixing this with loopfs, but that seems
    to have stalled.)

* While it is not in the current design, the simplicity of the format
    and lack of loopback makes it at least theoretically possible that
    composefs can be made usable in a rootless fashion at some point in
    the future.
Do you consider sending some commands to /dev/cachefiles to configure
a daemonless dir and mount erofs image directly by using "erofs over
fscache" but in a daemonless way?  That is an ongoing stuff on our side.

IMHO, I don't think file-based interfaces are quite a charmful stuff.
Historically I recalled some practice is to "avoid directly reading
files in kernel" so that I think almost all local fses don't work on
files directl and loopback devices are all the ways for these use
cases.  If loopback devices are not okay to you, how about improving
loopback devices and that will benefit to almost all local fses.


And of course, there are disadvantages to composefs too. Primarily
being more code, increasing maintenance burden and risk of security
problems. Composefs is particularly burdensome because it is a
stacking filesystem and these have historically been shown to be hard
to get right.


The question now is what is the best approach overall? For my own
primary usecase of making a verifying ostree root filesystem, the
overlay approach (with the lazyfollow work finished) is, while not
ideal, good enough.

So your judgement is still "ls -lR" and your use case is still just
pure read-only and without writable stuff?

Anyway, I'm really happy to work with you on your ostree use cases
as always, as long as all corner cases work out by the community.


But I know for the people who are more interested in using composefs
for containers the eventual goal of rootless support is very
important. So, on behalf of them I guess the question is: Is there
ever any chance that something like composefs could work rootlessly?
Or conversely: Is there some way to get rootless support from the
overlay approach? Opinions? Ideas?

Honestly, I do want to get a proper answer when Giuseppe asked me
the same question.  My current view is simply "that question is
almost the same for all in-kernel fses with some on-disk format".

As far as I'm concerned filesystems with on-disk format will not be made
mountable by unprivileged containers. And I don't think I'm alone in
that view. The idea that ever more parts of the kernel with a massive
attack surface such as a filesystem need to vouchesafe for the safety in
the face of every rando having access to
unshare --mount --user --map-root is a dead end and will just end up
trapping us in a neverending cycle of security bugs (Because every
single bug that's found after making that fs mountable from an
unprivileged container will be treated as a security bug no matter if
justified or not. So this is also a good way to ruin your filesystem's
reputation.).

And honestly, if we set the precedent that it's fine for one filesystem
with an on-disk format to be able to be mounted by unprivileged
containers then other filesystems eventually want to do this as well.

At the rate we currently add filesystems that's just a matter of time
even if none of the existing ones would also want to do it. And then
we're left arguing that this was just an exception for one super
special, super safe, unexploitable filesystem with an on-disk format.

Yes, +1.  That's somewhat why I didn't answer immediately since I'd like
to find a chance to get more people interested in EROFS so I hope it could
be (somewhat) pointed out by other filesystem guys at that time.


Imho, none of this is appealing. I don't want to slowly keep building a
future where we end up running fuzzers in unprivileged container to
generate random images to crash the kernel.

Even fuzzers don't guarantee this unless we completely freeze the fs
code, otherwise any useful improvement will need a much much deep and
long long fuzzing in principle.  I'm not sure even if it could catch
release timing at all, and bug-free, honestly.


I have more arguments why I don't think is a path we will ever go down
but I don't want this to detract from the legitimate ask of making it
possible to mount trusted images from within unprivileged containers.
Because I think that's perfectly legitimate.

However, I don't think that this is something the kernel needs to solve
other than providing the necessary infrastructure so that this can be
solved in userspace.

Yes, I think it's a principle as long as we have a way to do thing in
userspace effectively.


Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I
explained how we currently mount into mount namespaces of unprivileged
cotainers which had been quite a difficult problem before the new mount
api. But now it's become almost comically trivial. I mean, there's stuff
that will still be good to have but overall all the bits are already
there.

Imho, delegated mounting should be done by a system service that is
responsible for all the steps that require privileges. So for most
filesytems not mountable by unprivileged user this would amount to:

fd_fs = fsopen("xfs")
fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
fsconfig(FSCONFIG_CMD_CREATE)
fd_mnt = fsmount(fd_fs)
// Only required for attributes that require privileges against the sb
// of the filesystem such as idmapped mounts
mount_setattr(fd_mnt, ...)

and then the fd_mnt can be sent to the container which can then attach
it wherever it wants to. The system level service doesn't even need to
change namespaces via setns(fd_userns|fd_mntns) like I illustrated in
the post I did. It's sufficient if we sent it via AF_UNIX for example
that's exposed to the container.

Of course, this system level service would be integrated with mount(8)
directly over a well-defined protocol. And this would be nestable as
well by e.g., bind-mounting the AF_UNIX socket.

And we do already support a rudimentary form of such integration through
systemd. For example via mount -t ddi (cf. [2]) which makes it possible
to mount discoverable disk images (ddi). But that's just an
illustration.

This should be integrated with mount(8) and should be a simply protocol
over varlink or another lightweight ipc mechanism that can be
implemented by systemd-mountd (which is how I coined this for lack of
imagination when I came up with this) or by some other component if
platforms like k8s really want to do their own thing.

This also allows us to extend this feature to the whole system btw and
to all filesystems at once. Because it means that if systemd-mountd is
told what images to trust (based on location, from a specific registry,
signature, or whatever) then this isn't just useful for unprivileged
containers but also for regular users on the host that want to mount
stuff.

This is what we're currently working on.

(There's stuff that we can do to make this more powerful __if__ we need
to. One example would probably that we _could_ make it possible to mark
a superblock as being owned by a specific namespace with similar
permission checks as what we currently do for idmapped mounts
(privileged in the superblock of the fs, privileged over the ns to
delegate to etc). IOW,

fd_fs = fsopen("xfs")
fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns)

which completely sidesteps the issue of making that on-disk filesystem
mountable by unpriv users.

But let me say that this is completely unnecessary today as you can do:

fd_fs = fsopen("xfs")
fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
fsconfig(FSCONFIG_CMD_CREATE)
fd_mnt = fsmount(fd_fs)
mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP)

which changes ownership across the whole filesystem. The only time you
really want what I mention here is if you want to delegate control over
__every single ioctl and potentially destructive operation associated
with that filesystem__ to an unprivileged container which is almost
never what you want.)

Good to know this.  I do hope it can be resolved by the userspace stuffs
as you said.  So is there some barrier to not do like this, so that we
have to bother with FS_USERNS_MOUNT for fses with on-disk format?  Your
delegate control is a good stuff at least on my side and we hope some
system-wide service can help this since our cloud might need this in the
future as well.

Thanks,
Gao Xiang


[1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
[2]: https://github.com/systemd/systemd/pull/26695



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux