On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote: > Hi Colin, > > On 2023/3/4 22:59, Colin Walters wrote: >> >> >> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote: >>> >>> Actually since you're container guys, I would like to mention >>> a way to directly reuse OCI tar data and not sure if you >>> have some interest as well, that is just to generate EROFS >>> metadata which could point to the tar blobs so that data itself >>> is still the original tar, but we could add fsverity + IMMUTABLE >>> to these blobs rather than the individual untared files. >> >>> - OCI layer diff IDs in the OCI spec [1] are guaranteed; >> >> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think. > > Thanks for the interest and comment. > > I'm not aware of this project, and I'm not sure if tar-split > helps mount tar stuffs, maybe I'm missing something? Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top. Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention. Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc. Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked. It's really only very simplistic use cases for which a single read-only filesystem suffices. They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc. But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems. But without also forcing separate size management onto both. > Not bacause EROFS cannot do on-disk dedupe, just because in this > way EROFS can only use the original tar blobs, and EROFS is not > the guy to resolve the on-disk sharing stuff. Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case. > As a kernel filesystem, if two files are equal, we could treat them > in the same inode address space, even they are actually with slightly > different inode metadata (uid, gid, mode, nlink, etc). That is > entirely possible as an in-kernel filesystem even currently linux > kernel doesn't implement finer page cache sharing, so EROFS can > support page-cache sharing of files in all tar streams if needed. Hmmm. I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack). But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally. And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it. That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively!