Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay

"Colin Walters" <walters@xxxxxxxxxx> · Mon, 06 Mar 2023 20:00:45 -0500

On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote:
> Hi Colin,
>
> On 2023/3/4 22:59, Colin Walters wrote:
>> 
>> 
>> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:
>>>
>>> Actually since you're container guys, I would like to mention
>>> a way to directly reuse OCI tar data and not sure if you
>>> have some interest as well, that is just to generate EROFS
>>> metadata which could point to the tar blobs so that data itself
>>> is still the original tar, but we could add fsverity + IMMUTABLE
>>> to these blobs rather than the individual untared files.
>> 
>>>    - OCI layer diff IDs in the OCI spec [1] are guaranteed;
>> 
>> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.
>
> Thanks for the interest and comment.
>
> I'm not aware of this project, and I'm not sure if tar-split
> helps mount tar stuffs, maybe I'm missing something?

Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top.   Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention.

Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc.  Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked.

It's really only very simplistic use cases for which a single read-only filesystem suffices.  They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc. 

But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems.   But without also forcing separate size management onto both.

> Not bacause EROFS cannot do on-disk dedupe, just because in this
> way EROFS can only use the original tar blobs, and EROFS is not
> the guy to resolve the on-disk sharing stuff.  

Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case.

> As a kernel filesystem, if two files are equal, we could treat them
> in the same inode address space, even they are actually with slightly
> different inode metadata (uid, gid, mode, nlink, etc).  That is
> entirely possible as an in-kernel filesystem even currently linux
> kernel doesn't implement finer page cache sharing, so EROFS can
> support page-cache sharing of files in all tar streams if needed.

Hmmm.  I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack).  But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally.  And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it.  

That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively!