On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote: > On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote: > > Christian Brauner <brauner@xxxxxxxxxx> writes: > > 2) no multi repo support: > > > > Both reflinks and hardlinks do not work across mount points, so we > > Just fwiw, afaict reflinks work across mount points since at least 5.18. The might work for NFS server *file clones* across different exports within the same NFS server (or server cluster), but they most certainly don't work across mountpoints for local filesystems, or across different types of filesystems. I'm not here to advocate that composefs as the right solution, I'm just pointing out that the proposed alternatives do not, in any way, have the same critical behavioural characteristics as composefs provides container orchestration systems and hence do not solve the problems that composefs is attempting to solve. In short: any solution that requires userspace to create a new filesystem heirarchy one file at a time via standard syscall mechanisms is not going to perform acceptibly at scale - that's a major problem that composefs addresses. The whole problem with file copying to create images - even with reflink or hardlinks avoiding data copying - is the overhead of creating and destroying those copies in the first place. A reflink copy of a tens of thousands of files in a complex directory structure is not free - each individual reflink has a time, CPU, memory and IO cost to it. The teardown cost is similar - the only way to remove the "container image" built with reflinks is "rm -rf", and that has significant time, CPU memory and IO costs associated with it as well. Further, you can't ship container images to remote hosts using reflink copies - they can only be created at runtime on the host that the container will be instantiated on. IOWs, the entire cost of reflink copies for container instances must be taken at container instantiation and destruction time. When you have container instances that might only be needed for a few seconds, taking half a minute to set up the container instance and then another half a minute to tear it down just isn't viable - we need instantiation and teardown times in the order of a second or two. >From my reading of the code, composefs is based around the concept of a verifiable "shipping manifest", where the filesystem namespace presented to users by the kernel is derived from the manifest rahter than from some other filesystem namespace. Overlay, reflinks, etc all use some other filesystem namespace to generate the container namespace that links to the common data, whilst composefs uses the manifest for that. The use of a minfest file means there is almost zero container setup overhead - ship the manifest file, mount it, all done - and zero teardown overhead as unmounting the filesystem is all that is needed to remove all traces of the container instance from the system. In having a custom manifest format, the manifest can easily contain verification information alongside the pointer to the content the namespace should expose. i.e. the manifest references a secure content addressed repository that is protected by fsverity and contains the fsverity digests itself. Hence it doesn't rely on the repository to self-verify, it actually ensures that the repository files actually contain the data the manifest expects them to contain. Hence if the composefs kernel module is provided with a mechanism for validating the chain of trust for the manifest file that a user is trying to mount, then we just don't care who the mounting user is. This architecture is a viable path to rootless mounting of pre-built third party container images. Also, with the host's content addressed repository being managed separately by the trusted host and distro package management, the manifest is not be unique to a single container host. The distro can build manifests so that containers are running known, signed and verified container images built by the distro. The container orchestration software or admin could also build manifests on demand and sign them. If the manifest is not signed, not signed with a key loaded into the kernel keyring, or does not pass verification, then we simply fall back to root-in-the-init-ns permissions being required to mount the manifest. This fallback is exactly the same security model we have for every other type of filesystem image that the linux kernel can mount - we trust root not to be mounting malicious images. Essentially, I don't think any of the filesystems in the linux kernel currently provide a viable solution to the problem that composefs is trying to solve. We need a different way of solving the ephemeral container namespace creation and destruction overhead problem. Composefs provides a mechanism that not only solves this problem and potentially several others, whilst also being easy to retrofit into existing production container stacks. As such, I think composefs is definitely worth further time and investment as a unique line of filesystem development for Linux. Solve the chain of trust problem (i.e. crypto signing for the manifest files) and we potentially have game changing container infrastructure in a couple of thousand lines of code... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx