Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2023/3/7 09:00, Colin Walters wrote:


On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote:
Hi Colin,

On 2023/3/4 22:59, Colin Walters wrote:


On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:

Actually since you're container guys, I would like to mention
a way to directly reuse OCI tar data and not sure if you
have some interest as well, that is just to generate EROFS
metadata which could point to the tar blobs so that data itself
is still the original tar, but we could add fsverity + IMMUTABLE
to these blobs rather than the individual untared files.

    - OCI layer diff IDs in the OCI spec [1] are guaranteed;

The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.

Thanks for the interest and comment.

I'm not aware of this project, and I'm not sure if tar-split
helps mount tar stuffs, maybe I'm missing something?

Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top.   Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention.

Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc.  Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked.

Yes, as you said, if you think the actual OCI standard (or Docker
whatever) is all about layering.  There could be a possibility to
directly use the original layer for mounting without any conversion
(like "untar" or converting to another blob format which could
 support 4k reflink dedupe.)

I believe it can save untar time and snapshot gc problems that
users concern, such as our cloud with thousands of containers
launching/running/gcing in the same time.


It's really only very simplistic use cases for which a single read-only filesystem suffices.  They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc.

I cannot access the webside. If you consider physical write
protection, then a read-only filesystem written on physical
media is needed.  So that EROFS manifest can be landed on
raw disks (for write protection and hardware integrate check)
or on other local filesystems.  It depends on the actual
detailed requirement.


But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems.   But without also forcing separate size management onto both.

Not bacause EROFS cannot do on-disk dedupe, just because in this
way EROFS can only use the original tar blobs, and EROFS is not
the guy to resolve the on-disk sharing stuff.

Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case.

Anyway, if you consider an _untar_ way, you could also
consider a conversion way (like you said padding to 4k).

Since OCI standard is all about layering, so you could
pad to 4k and then do data dedupe with:
  - data blobs theirselves (some recent project like
    Nydus with EROFS);
  - reflink enabled filesystems (such as XFS or btrfs).

Because untar behaves almost the same as the conversion
way, except that it doesn't produce massive files/dirs
to the underlay filesystem and then gc massive files/dirs
again.

To be clarified, since you are the OSTree original author,
here I'm not promoting alternative ways for you.  I believe
any practical engineering projects all have advantages and
disadvantages.  For example, even git is moving toward using
packed object store more and more, and I guess OSTree for
effective distribution could also have some packed format
at least to some extent.

Here I just would like to say, on-disk EROFS format (or
other most-used kernel filesystem) is not just designed for
a specific use cases like OSTree, tar blobs or whatever, or
specific media like block-based, file-based, etc.

As far as I can see, at least EROFS+overlay already supports
the OSTree composefs-like use cases for two years and landed
in many distros. And other local kernel filesystems don't
behave quite well with "ls -lR" workload.


As a kernel filesystem, if two files are equal, we could treat them
in the same inode address space, even they are actually with slightly
different inode metadata (uid, gid, mode, nlink, etc).  That is
entirely possible as an in-kernel filesystem even currently linux
kernel doesn't implement finer page cache sharing, so EROFS can
support page-cache sharing of files in all tar streams if needed.

Hmmm.  I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack).  But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally.  And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it.

As you said you're a userspace developer, here I just need
to clarify internal inodes are very common among local fses,
at least to my knowledge I know btrfs and f2fs in addition to
EROFS all have such stuffs to keep something to make use of
kernel page cache.

One advantage over the stackable way is that:  With the
stackable way, you have to explicitly open the backing file
which takes more time to lookup dcache/icache and even on-disk
hierarchy.  By contrast, if you consider page cache sharing
original tar blobs, you don't need to do another open at all.
Surely, it's not benchmarked by "ls -lR" but it indeed impacts
end users.

Again, here I'm trying to say I'm not in favor of or against
any user-space distribution solution, like OSTree or some else.
Nydus is just one of userspace examples to use EROFS which I
persuaded them to do such adaption.  Besides, EROFS is already
landed to all mainstream in-market Android smartphones, and I
hope it can get more attention, adaption over various use cases
and more developers could join us.


That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively!

Thanks, as a kernel filesystem developer for many years, I hope
our (at least myself) design can be used wider.  So again, I'm
not against your OSTree design and I believe all detailed
distribution approaches have pros and cons.

Thanks,
Gao Xiang





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux