Giuseppe Scrivano <gscrivan@xxxxxxxxxx> writes: > Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes: > >> On 2023/1/22 06:34, Giuseppe Scrivano wrote: >>> Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes: >>> >>>> On 2023/1/22 00:19, Giuseppe Scrivano wrote: >>>>> Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes: >>>>> >>>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote: >>>>>>> Hi Amir, >>>>>>> Amir Goldstein <amir73il@xxxxxxxxx> writes: >>>>>>> >>>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@xxxxxxxxxx> wrote: >>>>>> >>>>>> ... >>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Hi Alexander, >>>>>>>> >>>>>>>> I must say that I am a little bit puzzled by this v3. >>>>>>>> Gao, Christian and myself asked you questions on v2 >>>>>>>> that are not mentioned in v3 at all. >>>>>>>> >>>>>>>> To sum it up, please do not propose composefs without explaining >>>>>>>> what are the barriers for achieving the exact same outcome with >>>>>>>> the use of a read-only overlayfs with two lower layer - >>>>>>>> uppermost with erofs containing the metadata files, which include >>>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>>>>>>> to the lowermost layer containing the content files. >>>>>>> I think Dave explained quite well why using overlay is not >>>>>>> comparable to >>>>>>> what composefs does. >>>>>>> One big difference is that overlay still requires at least a syscall >>>>>>> for >>>>>>> each file in the image, and then we need the equivalent of "rm -rf" to >>>>>>> clean it up. It is somehow acceptable for long-running services, but it >>>>>>> is not for "serverless" containers where images/containers are created >>>>>>> and destroyed frequently. So even in the case we already have all the >>>>>>> image files available locally, we still need to create a checkout with >>>>>>> the final structure we need for the image. >>>>>>> I also don't see how overlay would solve the verified image problem. >>>>>>> We >>>>>>> would have the same problem we have today with fs-verity as it can only >>>>>>> validate a single file but not the entire directory structure. Changes >>>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect} >>>>>>> xattrs won't be noticed. >>>>>>> There are at the moment two ways to handle container images, both >>>>>>> somehow >>>>>>> guided by the available file systems in the kernel. >>>>>>> - A single image mounted as a block device. >>>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as >>>>>>> overlay layers. >>>>>>> One big advantage of the block devices model is that you can use >>>>>>> dm-verity, this is something we miss today with OCI container images >>>>>>> that use overlay. >>>>>>> What we are proposing with composefs is a way to have "dm-verity" >>>>>>> style >>>>>>> validation based on fs-verity and the possibility to share individual >>>>>>> files instead of layers. These files can also be on different file >>>>>>> systems, which is something not possible with the block device model. >>>>>> >>>>>> That is not a new idea honestly, including chain of trust. Even laterly >>>>>> out-of-tree incremental fs using fs-verity for this as well, except that >>>>>> it's in a real self-contained way. >>>>>> >>>>>>> The composefs manifest blob could be generated remotely and signed. >>>>>>> A >>>>>>> client would need just to validate the signature for the manifest blob >>>>>>> and from there retrieve the files that are not in the local CAS (even >>>>>>> from an insecure source) and mount directly the manifest file. >>>>>> >>>>>> >>>>>> Back to the topic, after thinking something I have to make a >>>>>> compliment for reference. >>>>>> >>>>>> First, EROFS had the same internal dissussion and decision at >>>>>> that time almost _two years ago_ (June 2021), it means: >>>>>> >>>>>> a) Some internal people really suggested EROFS could develop >>>>>> an entire new file-based in-kernel local cache subsystem >>>>>> (as you called local CAS, whatever) with stackable file >>>>>> interface so that the exist Nydus image service [1] (as >>>>>> ostree, and maybe ostree can use it as well) don't need to >>>>>> modify anything to use exist blobs; >>>>>> >>>>>> b) Reuse exist fscache/cachefiles; >>>>>> >>>>>> The reason why we (especially me) finally selected b) because: >>>>>> >>>>>> - see the people discussion of Google's original Incremental >>>>>> FS topic [2] [3] in 2019, as Amir already mentioned. At >>>>>> that time all fs folks really like to reuse exist subsystem >>>>>> for in-kernel caching rather than reinvent another new >>>>>> in-kernel wheel for local cache. >>>>>> >>>>>> [ Reinventing a new wheel is not hard (fs or caching), just >>>>>> makes Linux more fragmented. Especially a new filesystem >>>>>> is just proposed to generate images full of massive massive >>>>>> new magical symlinks with *overriden* uid/gid/permissions >>>>>> to replace regular files. ] >>>>>> >>>>>> - in-kernel cache implementation usually met several common >>>>>> potential security issues; reusing exist subsystem can >>>>>> make all fses addressed them and benefited from it. >>>>>> >>>>>> - Usually an exist widely-used userspace implementation is >>>>>> never an excuse for a new in-kernel feature. >>>>>> >>>>>> Although David Howells is always quite busy these months to >>>>>> develop new netfs interface, otherwise (we think) we should >>>>>> already support failover, multiple daemon/dirs, daemonless and >>>>>> more. >>>>> we have not added any new cache system. overlay does "layer >>>>> deduplication" and in similar way composefs does "file deduplication". >>>>> That is not a built-in feature, it is just a side effect of how things >>>>> are packed together. >>>>> Using fscache seems like a good idea and it has many advantages but >>>>> it >>>>> is a centralized cache mechanism and it looks like a potential problem >>>>> when you think about allowing mounts from a user namespace. >>>> >>>> I think Christian [1] had the same feeling of my own at that time: >>>> >>>> "I'm pretty skeptical of this plan whether we should add more filesystems >>>> that are mountable by unprivileged users. FUSE and Overlayfs are >>>> adventurous enough and they don't have their own on-disk format. The >>>> track record of bugs exploitable due to userns isn't making this >>>> very attractive." >>>> >>>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use >>>> dm-verity) as well, but it doesn't change _anything_ about concerns of >>>> "allowing mounts from a user namespace". >>> I've mentioned that as a potential feature we could add in future, >>> given >>> the simplicity of the format and that it uses a CAS for its data instead >>> of fscache. Each user can have and use their own store to mount the >>> images. >>> At this point it is just a wish from userspace, as it would improve >>> a >>> few real use cases we have. >>> Having the possibility to run containers without root privileges is >>> a >>> big deal for many users, look at Flatpak apps for example, or rootless >>> Podman. Mounting and validating images would be a a big security >>> improvement. It is something that is not possible at the moment as >>> fs-verity doesn't cover the directory structure and dm-verity seems out >>> of reach from a user namespace. >>> Composefs delegates the entire logic of dealing with files to the >>> underlying file system in a similar way to overlay. >>> Forging the inode metadata from a user namespace mount doesn't look >>> like an insurmountable problem as well since it is already possible >>> with a FUSE filesystem. >>> So the proposal/wish here is to have a very simple format, that at >>> some >>> point could be considered safe to mount from a user namespace, in >>> addition to overlay and FUSE. >> >> My response is quite similar to >> https://lore.kernel.org/r/CAJfpeguyajzHwhae=4PWLF4CUBorwFWeybO-xX6UBD2Ekg81fg@xxxxxxxxxxxxxx/ > > I don't see how that applies to what I said about unprivileged mounts, > except the part about lazy download where I agree with Miklos that > should be handled through FUSE and that is something possible with > composefs: > > mount -t composefs composefs -obasedir=/path/to/store:/mnt/fuse /mnt/cfs > > where /mnt/fuse is handled by a FUSE file system that takes care of > loading the files from the remote server, and possibly write them to > /path/to/store once they are completed. > > So each user could have their "lazy download" without interfering with > other users or the centralized cache. > >>> >>>>> As you know as I've contacted you, I've looked at EROFS in the past >>>>> and tried to get our use cases to work with it before thinking about >>>>> submitting composefs upstream. >>>>> From what I could see EROFS and composefs use two different >>>>> approaches >>>>> to solve a similar problem, but it is not possible to do exactly with >>>>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a >>>>> block device that uses fscache, and composefs as an overlay for files >>>>> instead of directories. >>>> >>>> I don't think so honestly. EROFS "Multiple device" feature is >>>> actually "multiple blobs" feature if you really think "device" >>>> is block device. >>>> >>>> Primary device -- primary blob -- "composefs manifest blob" >>>> Blob device -- data blobs -- "composefs backing files" >>>> >>>> any difference? >>> I wouldn't expect any substancial difference between two RO file >>> systems. >>> Please correct me if I am wrong: EROFS uses 16 bits for the blob >>> device >>> ID, so if we map each file to a single blob device we are kind of >>> limited on how many files we can have. >> >> I was here just to represent "composefs manifest file" concept rather than >> device ID. >> >>> Sure this is just an artificial limit and can be bumped in a future >>> version but the major difference remains: EROFS uses the blob device >>> through fscache while the composefs files are looked up in the specified >>> repositories. >> >> No, fscache can also open any cookie when opening file. Again, even with >> fscache, EROFS doesn't need to modify _any_ on-disk format to: >> >> - record a "cookie id" for such special "magical symlink" with a similar >> symlink on-disk format (or whatever on-disk format with data, just with >> a new on-disk flag); >> >> - open such "cookie id" on demand when opening such EROFS file just as >> any other network fses. I don't think blob device is limited here. >> >> some difference now? > > recording the "cookie id" is done by a singleton userspace daemon that > controls the cachefiles device and requires one operation for each file > before the image can be mounted. > > Is that the case or I misunderstood something? > >>> >>>>> Sure composefs is quite simple and you could embed the composefs >>>>> features in EROFS and let EROFS behave as composefs when provided a >>>>> similar manifest file. But how is that any better than having a >>>> >>>> EROFS always has such feature since v5.16, we called primary device, >>>> or Nydus concept --- "bootstrap file". >>>> >>>>> separate implementation that does just one thing well instead of merging >>>>> different paradigms together? >>>> >>>> It's exist fs on-disk compatible (people can deploy the same image >>>> to wider scenarios), or you could modify/enhacnce any in-kernel local >>>> fs to do so like I already suggested, such as enhancing "fs/romfs" and >>>> make it maintained again due to this magic symlink feature >>>> >>>> (because composefs don't have other on-disk requirements other than >>>> a symlink path and a SHA256 verity digest from its original >>>> requirement. Any local fs can be enhanced like this.) >>>> >>>>> >>>>>> I know that you guys repeatedly say it's a self-contained >>>>>> stackable fs and has few code (the same words as Incfs >>>>>> folks [3] said four years ago already), four reasons make it >>>>>> weak IMHO: >>>>>> >>>>>> - I think core EROFS is about 2~3 kLOC as well if >>>>>> compression, sysfs and fscache are all code-truncated. >>>>>> >>>>>> Also, it's always welcome that all people could submit >>>>>> patches for cleaning up. I always do such cleanups >>>>>> from time to time and makes it better. >>>>>> >>>>>> - "Few code lines" is somewhat weak because people do >>>>>> develop new features, layout after upstream. >>>>>> >>>>>> Such claim is usually _NOT_ true in the future if you >>>>>> guys do more to optimize performance, new layout or even >>>>>> do your own lazy pulling with your local CAS codebase in >>>>>> the future unless >>>>>> you *promise* you once dump the code, and do bugfix >>>>>> only like Christian said [4]. >>>>>> >>>>>> From LWN.net comments, I do see the opposite >>>>>> possibility that you'd like to develop new features >>>>>> later. >>>>>> >>>>>> - In the past, all in-tree kernel filesystems were >>>>>> designed and implemented without some user-space >>>>>> specific indication, including Nydus and ostree (I did >>>>>> see a lot of discussion between folks before in ociv2 >>>>>> brainstorm [5]). >>>>> Since you are mentioning OCI: >>>>> Potentially composefs can be the file system that enables something >>>>> very >>>>> close to "ociv2", but it won't need to be called v2 since it is >>>>> completely compatible with the current OCI image format. >>>>> It won't require a different image format, just a seekable tarball >>>>> that >>>>> is compatible with old "v1" clients and we need to provide the composefs >>>>> manifest file. >>>> >>>> May I ask did you really look into what Nydus + EROFS already did (as you >>>> mentioned we discussed before)? >>>> >>>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see: >>>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md >>>> >>>> "Rafs is a filesystem image containing a separated metadata blob and >>>> several data-deduplicated content-addressable data blobs. In a typical >>>> rafs filesystem, the metadata is stored in bootstrap while the data >>>> is stored in blobfile. >>>> ... >>>> >>>> bootstrap: The metadata is a merkle tree (I think that is typo, should be >>>> filesystem tree) whose nodes represents a regular filesystem's >>>> directory/file a leaf node refers to a file and contains hash value of >>>> its file data. >>>> Root node and internal nodes refer to directories and contain the >>>> hash value >>>> of their children nodes." >>>> >>>> Nydus is already supported "It won't require a different image format, just >>>> a seekable tarball that is compatible with old "v1" clients and we need to >>>> provide the composefs manifest file." feature in v2.2 and will be released >>>> later. >>> Nydus is not using a tarball compatible with OCI v1. >>> It defines a media type >>> "application/vnd.oci.image.layer.nydus.blob.v1", that >>> means it is not compatible with existing clients that don't know about >>> it and you need special handling for that. >> >> I am not sure what you're saying: "media type" is quite out of topic here. >> >> If you said "mkcomposefs" is done in the server side, what is the media >> type of such manifest files? >> >> And why not Nydus cannot do in the same way? >> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md >> > > I am not talking about the manifest or the bootstrap file, I am talking > about the data blobs. > >>> Anyway, let's not bother LKML folks with these userspace details. >>> It >>> has no relevance to the kernel and what file systems do. >> >> I'd like to avoid, I did't say anything about userspace details, I just would >> like to say >> "merged filesystem tree is also _not_ a new idea of composefs" >> not "media type", etc. >> >>> >>>>> The seekable tarball allows individual files to be retrieved. OCI >>>>> clients will not need to pull the entire tarball, but only the individual >>>>> files that are not already present in the local CAS. They won't also need >>>>> to create the overlay layout at all, as we do today, since it is already >>>>> described with the composefs manifest file. >>>>> The manifest is portable on different machines with different >>>>> configurations, as you can use multiple CAS when mounting composefs. >>>>> Some users might have a local CAS, some others could have a >>>>> secondary >>>>> CAS on a network file system and composefs support all these >>>>> configurations with the same signed manifest file. >>>>> >>>>>> That is why EROFS selected exist in-kernel fscache and >>>>>> made userspace Nydus adapt it: >>>>>> >>>>>> even (here called) manifest on-disk format --- >>>>>> EROFS call primary device --- >>>>>> they call Nydus bootstrap; >>>>>> >>>>>> I'm not sure why it becomes impossible for ... ($$$$). >>>>> I am not sure what you mean, care to elaborate? >>>> >>>> I just meant these concepts are actually the same concept with >>>> different names and: >>>> Nydus is a 2020 stuff; >>> CRFS[1] is 2019 stuff. >> >> Does CRFS have anything similiar to a merged filesystem tree? >> >> Here we talked about local CAS: >> I have no idea CRFS has anything similar to it. > > yes it does and it uses it with a FUSE file system. So neither > composefs nor EROFS have invented anything here. > > Anyway, does it really matter who made what first? I don't see how it > helps to understand if there are relevant differences in composefs to > justify its presence in the kernel. > >>> >>>> EROFS + primary device is a 2021-mid stuff. >>>> >>>>>> In addition, if fscache is used, it can also use >>>>>> fsverity_get_digest() to enable fsverity for non-on-demand >>>>>> files. >>>>>> >>>>>> But again I think even Google's folks think that is >>>>>> (somewhat) broken so that they added fs-verity to its incFS >>>>>> in a self-contained way in Feb 2021 [6]. >>>>>> >>>>>> Finally, again, I do hope a LSF/MM discussion for this new >>>>>> overlay model (full of massive magical symlinks to override >>>>>> permission.) >>>>> you keep pointing it out but nobody is overriding any permission. >>>>> The >>>>> "symlinks" as you call them are just a way to refer to the payload files >>>>> so they can be shared among different mounts. It is the same idea used >>>>> by "overlay metacopy" and nobody is complaining about it being a >>>>> security issue (because it is not). >>>> >>>> See overlay documentation clearly wrote such metacopy behavior: >>>> https://docs.kernel.org/filesystems/overlayfs.html >>>> >>>> " >>>> Do not use metacopy=on with untrusted upper/lower directories. >>>> Otherwise it is possible that an attacker can create a handcrafted file >>>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file >>>> on lower pointed by REDIRECT. This should not be possible on local >>>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But >>>> it should be possible for untrusted layers like from a pen drive. >>>> " >>>> >>>> Do we really need such behavior working on another fs especially with >>>> on-disk format? At least Christian said, >>>> "FUSE and Overlayfs are adventurous enough and they don't have their >>>> own on-disk format." >>> If users want to do something really weird then they can always find >>> a >>> way but the composefs lookup is limited under the directories specified >>> at mount time, so it is not possible to access any file outside the >>> repository. >>> >>>>> The files in the CAS are owned by the user that creates the mount, >>>>> so >>>>> there is no need to circumvent any permission check to access them. >>>>> We use fs-verity for these files to make sure they are not modified by a >>>>> malicious user that could get access to them (e.g. a container breakout). >>>> >>>> fs-verity is not always enforcing and it's broken here if fsverity is not >>>> supported in underlay fses, that is another my arguable point. >>> It is a trade-off. It is up to the user to pick a configuration >>> that >>> allows using fs-verity if they care about this feature. >> >> I don't think fsverity is optional with your plan. > > yes it is optional. without fs-verity it would behave the same as today > with overlay mounts without any fs-verity. > > How does validation work in EROFS for files served from fscache and that > are on a remote file system? nevermind my last question, I guess it would still go through the block device in EROFS. This is clearly a point in favor of a block device approach that a stacking file system like overlay or composefs cannot achieve without support from the underlying file system. > >> I wrote this all because it seems I didn't mention the original motivation >> to use fscache in v2: kernel already has such in-kernel local cache, and >> people liked to use it in 2019 rather than another stackable way (as >> mentioned in incremental fs thread.) > > still for us the stackable way works better. > >> Thanks, >> Gao Xiang >> >>> Regards, >>> Giuseppe >>> [1] https://github.com/google/crfs