Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Giuseppe Scrivano <gscrivan@xxxxxxxxxx> · Sun, 22 Jan 2023 10:01:33 +0100

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes:

> On 2023/1/22 06:34, Giuseppe Scrivano wrote:
>> Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes:
>> 
>>> On 2023/1/22 00:19, Giuseppe Scrivano wrote:
>>>> Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes:
>>>>
>>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>>>>>> Hi Amir,
>>>>>> Amir Goldstein <amir73il@xxxxxxxxx> writes:
>>>>>>
>>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@xxxxxxxxxx> wrote:
>>>>>
>>>>> ...
>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Hi Alexander,
>>>>>>>
>>>>>>> I must say that I am a little bit puzzled by this v3.
>>>>>>> Gao, Christian and myself asked you questions on v2
>>>>>>> that are not mentioned in v3 at all.
>>>>>>>
>>>>>>> To sum it up, please do not propose composefs without explaining
>>>>>>> what are the barriers for achieving the exact same outcome with
>>>>>>> the use of a read-only overlayfs with two lower layer -
>>>>>>> uppermost with erofs containing the metadata files, which include
>>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>>>>> to the lowermost layer containing the content files.
>>>>>> I think Dave explained quite well why using overlay is not
>>>>>> comparable to
>>>>>> what composefs does.
>>>>>> One big difference is that overlay still requires at least a syscall
>>>>>> for
>>>>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>>>>> clean it up.  It is somehow acceptable for long-running services, but it
>>>>>> is not for "serverless" containers where images/containers are created
>>>>>> and destroyed frequently.  So even in the case we already have all the
>>>>>> image files available locally, we still need to create a checkout with
>>>>>> the final structure we need for the image.
>>>>>> I also don't see how overlay would solve the verified image problem.
>>>>>> We
>>>>>> would have the same problem we have today with fs-verity as it can only
>>>>>> validate a single file but not the entire directory structure.  Changes
>>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>>>>>> xattrs won't be noticed.
>>>>>> There are at the moment two ways to handle container images, both
>>>>>> somehow
>>>>>> guided by the available file systems in the kernel.
>>>>>> - A single image mounted as a block device.
>>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as
>>>>>>      overlay layers.
>>>>>> One big advantage of the block devices model is that you can use
>>>>>> dm-verity, this is something we miss today with OCI container images
>>>>>> that use overlay.
>>>>>> What we are proposing with composefs is a way to have "dm-verity"
>>>>>> style
>>>>>> validation based on fs-verity and the possibility to share individual
>>>>>> files instead of layers.  These files can also be on different file
>>>>>> systems, which is something not possible with the block device model.
>>>>>
>>>>> That is not a new idea honestly, including chain of trust.  Even laterly
>>>>> out-of-tree incremental fs using fs-verity for this as well, except that
>>>>> it's in a real self-contained way.
>>>>>
>>>>>> The composefs manifest blob could be generated remotely and signed.
>>>>>> A
>>>>>> client would need just to validate the signature for the manifest blob
>>>>>> and from there retrieve the files that are not in the local CAS (even
>>>>>> from an insecure source) and mount directly the manifest file.
>>>>>
>>>>>
>>>>> Back to the topic, after thinking something I have to make a
>>>>> compliment for reference.
>>>>>
>>>>> First, EROFS had the same internal dissussion and decision at
>>>>> that time almost _two years ago_ (June 2021), it means:
>>>>>
>>>>>     a) Some internal people really suggested EROFS could develop
>>>>>        an entire new file-based in-kernel local cache subsystem
>>>>>        (as you called local CAS, whatever) with stackable file
>>>>>        interface so that the exist Nydus image service [1] (as
>>>>>        ostree, and maybe ostree can use it as well) don't need to
>>>>>        modify anything to use exist blobs;
>>>>>
>>>>>     b) Reuse exist fscache/cachefiles;
>>>>>
>>>>> The reason why we (especially me) finally selected b) because:
>>>>>
>>>>>     - see the people discussion of Google's original Incremental
>>>>>       FS topic [2] [3] in 2019, as Amir already mentioned.  At
>>>>>       that time all fs folks really like to reuse exist subsystem
>>>>>       for in-kernel caching rather than reinvent another new
>>>>>       in-kernel wheel for local cache.
>>>>>
>>>>>       [ Reinventing a new wheel is not hard (fs or caching), just
>>>>>         makes Linux more fragmented.  Especially a new filesystem
>>>>>         is just proposed to generate images full of massive massive
>>>>>         new magical symlinks with *overriden* uid/gid/permissions
>>>>>         to replace regular files. ]
>>>>>
>>>>>     - in-kernel cache implementation usually met several common
>>>>>       potential security issues; reusing exist subsystem can
>>>>>       make all fses addressed them and benefited from it.
>>>>>
>>>>>     - Usually an exist widely-used userspace implementation is
>>>>>       never an excuse for a new in-kernel feature.
>>>>>
>>>>> Although David Howells is always quite busy these months to
>>>>> develop new netfs interface, otherwise (we think) we should
>>>>> already support failover, multiple daemon/dirs, daemonless and
>>>>> more.
>>>> we have not added any new cache system.  overlay does "layer
>>>> deduplication" and in similar way composefs does "file deduplication".
>>>> That is not a built-in feature, it is just a side effect of how things
>>>> are packed together.
>>>> Using fscache seems like a good idea and it has many advantages but
>>>> it
>>>> is a centralized cache mechanism and it looks like a potential problem
>>>> when you think about allowing mounts from a user namespace.
>>>
>>> I think Christian [1] had the same feeling of my own at that time:
>>>
>>> "I'm pretty skeptical of this plan whether we should add more filesystems
>>>   that are mountable by unprivileged users. FUSE and Overlayfs are
>>>   adventurous enough and they don't have their own on-disk format. The
>>>   track record of bugs exploitable due to userns isn't making this
>>>   very attractive."
>>>
>>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use
>>> dm-verity) as well, but it doesn't change _anything_ about concerns of
>>> "allowing mounts from a user namespace".
>> I've mentioned that as a potential feature we could add in future,
>> given
>> the simplicity of the format and that it uses a CAS for its data instead
>> of fscache.  Each user can have and use their own store to mount the
>> images.
>> At this point it is just a wish from userspace, as it would improve
>> a
>> few real use cases we have.
>> Having the possibility to run containers without root privileges is
>> a
>> big deal for many users, look at Flatpak apps for example, or rootless
>> Podman.  Mounting and validating images would be a a big security
>> improvement.  It is something that is not possible at the moment as
>> fs-verity doesn't cover the directory structure and dm-verity seems out
>> of reach from a user namespace.
>> Composefs delegates the entire logic of dealing with files to the
>> underlying file system in a similar way to overlay.
>> Forging the inode metadata from a user namespace mount doesn't look
>> like an insurmountable problem as well since it is already possible
>> with a FUSE filesystem.
>> So the proposal/wish here is to have a very simple format, that at
>> some
>> point could be considered safe to mount from a user namespace, in
>> addition to overlay and FUSE.
>
> My response is quite similar to
> https://lore.kernel.org/r/CAJfpeguyajzHwhae=4PWLF4CUBorwFWeybO-xX6UBD2Ekg81fg@xxxxxxxxxxxxxx/

I don't see how that applies to what I said about unprivileged mounts,
except the part about lazy download where I agree with Miklos that
should be handled through FUSE and that is something possible with
composefs:

mount -t composefs composefs -obasedir=/path/to/store:/mnt/fuse /mnt/cfs

where /mnt/fuse is handled by a FUSE file system that takes care of
loading the files from the remote server, and possibly write them to
/path/to/store once they are completed.

So each user could have their "lazy download" without interfering with
other users or the centralized cache.

>> 
>>>> As you know as I've contacted you, I've looked at EROFS in the past
>>>> and tried to get our use cases to work with it before thinking about
>>>> submitting composefs upstream.
>>>>   From what I could see EROFS and composefs use two different
>>>> approaches
>>>> to solve a similar problem, but it is not possible to do exactly with
>>>> EROFS what we are trying to do.  To oversimplify it: I see EROFS as a
>>>> block device that uses fscache, and composefs as an overlay for files
>>>> instead of directories.
>>>
>>> I don't think so honestly.  EROFS "Multiple device" feature is
>>> actually "multiple blobs" feature if you really think "device"
>>> is block device.
>>>
>>> Primary device -- primary blob -- "composefs manifest blob"
>>> Blob device -- data blobs -- "composefs backing files"
>>>
>>> any difference?
>> I wouldn't expect any substancial difference between two RO file
>> systems.
>> Please correct me if I am wrong: EROFS uses 16 bits for the blob
>> device
>> ID, so if we map each file to a single blob device we are kind of
>> limited on how many files we can have.
>
> I was here just to represent "composefs manifest file" concept rather than
> device ID.
>
>> Sure this is just an artificial limit and can be bumped in a future
>> version but the major difference remains: EROFS uses the blob device
>> through fscache while the composefs files are looked up in the specified
>> repositories.
>
> No, fscache can also open any cookie when opening file.  Again, even with
> fscache, EROFS doesn't need to modify _any_ on-disk format to:
>
>   - record a "cookie id" for such special "magical symlink" with a similar
>     symlink on-disk format (or whatever on-disk format with data, just with
>     a new on-disk flag);
>
>   - open such "cookie id" on demand when opening such EROFS file just as
>     any other network fses.  I don't think blob device is limited here.
>
> some difference now?

recording the "cookie id" is done by a singleton userspace daemon that 
controls the cachefiles device and requires one operation for each file
before the image can be mounted.

Is that the case or I misunderstood something?

>> 
>>>> Sure composefs is quite simple and you could embed the composefs
>>>> features in EROFS and let EROFS behave as composefs when provided a
>>>> similar manifest file.  But how is that any better than having a
>>>
>>> EROFS always has such feature since v5.16, we called primary device,
>>> or Nydus concept --- "bootstrap file".
>>>
>>>> separate implementation that does just one thing well instead of merging
>>>> different paradigms together?
>>>
>>> It's exist fs on-disk compatible (people can deploy the same image
>>> to wider scenarios), or you could modify/enhacnce any in-kernel local
>>> fs to do so like I already suggested, such as enhancing "fs/romfs" and
>>> make it maintained again due to this magic symlink feature
>>>
>>> (because composefs don't have other on-disk requirements other than
>>>   a symlink path and a SHA256 verity digest from its original
>>>   requirement.  Any local fs can be enhanced like this.)
>>>
>>>>
>>>>> I know that you guys repeatedly say it's a self-contained
>>>>> stackable fs and has few code (the same words as Incfs
>>>>> folks [3] said four years ago already), four reasons make it
>>>>> weak IMHO:
>>>>>
>>>>>     - I think core EROFS is about 2~3 kLOC as well if
>>>>>       compression, sysfs and fscache are all code-truncated.
>>>>>
>>>>>       Also, it's always welcome that all people could submit
>>>>>       patches for cleaning up.  I always do such cleanups
>>>>>       from time to time and makes it better.
>>>>>
>>>>>     - "Few code lines" is somewhat weak because people do
>>>>>       develop new features, layout after upstream.
>>>>>
>>>>>       Such claim is usually _NOT_ true in the future if you
>>>>>       guys do more to optimize performance, new layout or even
>>>>>       do your own lazy pulling with your local CAS codebase in
>>>>>       the future unless
>>>>>       you *promise* you once dump the code, and do bugfix
>>>>>       only like Christian said [4].
>>>>>
>>>>>       From LWN.net comments, I do see the opposite
>>>>>       possibility that you'd like to develop new features
>>>>>       later.
>>>>>
>>>>>     - In the past, all in-tree kernel filesystems were
>>>>>       designed and implemented without some user-space
>>>>>       specific indication, including Nydus and ostree (I did
>>>>>       see a lot of discussion between folks before in ociv2
>>>>>       brainstorm [5]).
>>>> Since you are mentioning OCI:
>>>> Potentially composefs can be the file system that enables something
>>>> very
>>>> close to "ociv2", but it won't need to be called v2 since it is
>>>> completely compatible with the current OCI image format.
>>>> It won't require a different image format, just a seekable tarball
>>>> that
>>>> is compatible with old "v1" clients and we need to provide the composefs
>>>> manifest file.
>>>
>>> May I ask did you really look into what Nydus + EROFS already did (as you
>>> mentioned we discussed before)?
>>>
>>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see:
>>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md
>>>
>>> "Rafs is a filesystem image containing a separated metadata blob and
>>>   several data-deduplicated content-addressable data blobs. In a typical
>>>   rafs filesystem, the metadata is stored in bootstrap while the data
>>>   is stored in blobfile.
>>>   ...
>>>
>>>   bootstrap:  The metadata is a merkle tree (I think that is typo, should be
>>>   filesystem tree) whose nodes represents a regular filesystem's
>>>   directory/file a leaf node refers to a file and contains hash value of
>>>   its file data.
>>>     Root node and internal nodes refer to directories and contain the
>>>    hash value
>>>   of their children nodes."
>>>
>>> Nydus is already supported "It won't require a different image format, just
>>> a seekable tarball that is compatible with old "v1" clients and we need to
>>> provide the composefs manifest file." feature in v2.2 and will be released
>>> later.
>> Nydus is not using a tarball compatible with OCI v1.
>> It defines a media type
>> "application/vnd.oci.image.layer.nydus.blob.v1", that
>> means it is not compatible with existing clients that don't know about
>> it and you need special handling for that.
>
> I am not sure what you're saying: "media type" is quite out of topic here.
>
> If you said "mkcomposefs" is done in the server side, what is the media
> type of such manifest files?
>
> And why not Nydus cannot do in the same way?
> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md
>

I am not talking about the manifest or the bootstrap file, I am talking
about the data blobs.

>> Anyway, let's not bother LKML folks with these userspace details.
>> It
>> has no relevance to the kernel and what file systems do.
>
> I'd like to avoid, I did't say anything about userspace details, I just would
> like to say
> "merged filesystem tree is also _not_ a new idea of composefs"
> not "media type", etc.
>
>> 
>>>> The seekable tarball allows individual files to be retrieved.  OCI
>>>> clients will not need to pull the entire tarball, but only the individual
>>>> files that are not already present in the local CAS. They won't also need
>>>> to create the overlay layout at all, as we do today, since it is already
>>>> described with the composefs manifest file.
>>>> The manifest is portable on different machines with different
>>>> configurations, as you can use multiple CAS when mounting composefs.
>>>> Some users might have a local CAS, some others could have a
>>>> secondary
>>>> CAS on a network file system and composefs support all these
>>>> configurations with the same signed manifest file.
>>>>
>>>>>       That is why EROFS selected exist in-kernel fscache and
>>>>>       made userspace Nydus adapt it:
>>>>>
>>>>>         even (here called) manifest on-disk format ---
>>>>>              EROFS call primary device ---
>>>>>              they call Nydus bootstrap;
>>>>>
>>>>>       I'm not sure why it becomes impossible for ... ($$$$).
>>>> I am not sure what you mean, care to elaborate?
>>>
>>> I just meant these concepts are actually the same concept with
>>> different names and:
>>>    Nydus is a 2020 stuff;
>> CRFS[1] is 2019 stuff.
>
> Does CRFS have anything similiar to a merged filesystem tree?
>
> Here we talked about local CAS:
> I have no idea CRFS has anything similar to it.

yes it does and it uses it with a FUSE file system.  So neither
composefs nor EROFS have invented anything here.

Anyway, does it really matter who made what first?  I don't see how it
helps to understand if there are relevant differences in composefs to
justify its presence in the kernel.

>> 
>>>    EROFS + primary device is a 2021-mid stuff.
>>>
>>>>> In addition, if fscache is used, it can also use
>>>>> fsverity_get_digest() to enable fsverity for non-on-demand
>>>>> files.
>>>>>
>>>>> But again I think even Google's folks think that is
>>>>> (somewhat) broken so that they added fs-verity to its incFS
>>>>> in a self-contained way in Feb 2021 [6].
>>>>>
>>>>> Finally, again, I do hope a LSF/MM discussion for this new
>>>>> overlay model (full of massive magical symlinks to override
>>>>> permission.)
>>>> you keep pointing it out but nobody is overriding any permission.
>>>> The
>>>> "symlinks" as you call them are just a way to refer to the payload files
>>>> so they can be shared among different mounts.  It is the same idea used
>>>> by "overlay metacopy" and nobody is complaining about it being a
>>>> security issue (because it is not).
>>>
>>> See overlay documentation clearly wrote such metacopy behavior:
>>> https://docs.kernel.org/filesystems/overlayfs.html
>>>
>>> "
>>> Do not use metacopy=on with untrusted upper/lower directories.
>>> Otherwise it is possible that an attacker can create a handcrafted file
>>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file
>>> on lower pointed by REDIRECT. This should not be possible on local
>>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But
>>> it should be possible for untrusted layers like from a pen drive.
>>> "
>>>
>>> Do we really need such behavior working on another fs especially with
>>> on-disk format?  At least Christian said,
>>> "FUSE and Overlayfs are adventurous enough and they don't have their
>>> own on-disk format."
>> If users want to do something really weird then they can always find
>> a
>> way but the composefs lookup is limited under the directories specified
>> at mount time, so it is not possible to access any file outside the
>> repository.
>> 
>>>> The files in the CAS are owned by the user that creates the mount,
>>>> so
>>>> there is no need to circumvent any permission check to access them.
>>>> We use fs-verity for these files to make sure they are not modified by a
>>>> malicious user that could get access to them (e.g. a container breakout).
>>>
>>> fs-verity is not always enforcing and it's broken here if fsverity is not
>>> supported in underlay fses, that is another my arguable point.
>> It is a trade-off.  It is up to the user to pick a configuration
>> that
>> allows using fs-verity if they care about this feature.
>
> I don't think fsverity is optional with your plan.

yes it is optional.  without fs-verity it would behave the same as today
with overlay mounts without any fs-verity.

How does validation work in EROFS for files served from fscache and that
are on a remote file system?

> I wrote this all because it seems I didn't mention the original motivation
> to use fscache in v2: kernel already has such in-kernel local cache, and
> people liked to use it in 2019 rather than another stackable way (as
> mentioned in incremental fs thread.)

still for us the stackable way works better.

> Thanks,
> Gao Xiang
>
>> Regards,
>> Giuseppe
>> [1] https://github.com/google/crfs