Re: overlayfs: NFS lowerdir changes & opaque negative lookups

Mike Baynton <mike@xxxxxxxxxxxx> · Sun, 21 Jul 2024 22:02:46 -0500

On 7/12/24 04:09, Amir Goldstein wrote:
> On Fri, Jul 12, 2024 at 6:24 AM Mike Baynton <mike@xxxxxxxxxxxx> wrote:
>>
>> On 7/11/24 18:30, Amir Goldstein wrote:
>>> On Thu, Jul 11, 2024 at 6:59 PM Daire Byrne <daire@xxxxxxxx> wrote:
>>>> Basically I have a read-only NFS filesystem with software releases
>>>> that are versioned such that no files are ever overwritten or changed.
>>>> New uniquely named directory trees and files are added from time to
>>>> time and older ones are cleaned up.
>>>>
>>>
>>> Sounds like a common use case that many people are interested in.
>>
>> I can vouch that that's accurate, I'm doing nearly the same thing. The
>> properties of the NFS filesystem in terms of what is and is not expected
>> to change is identical for me, though my approach to incorporating
>> overlayfs has been a little different.
>>
>> My confidence in the reliability of what I'm doing is still far from
>> absolute, so I will be interested in efforts to validate/officially
>> sanction/support/document related techniques.
>>
>> The way I am doing it is with NFS as a data-only layer. Basically my use
>> case calls for presenting different views of NFS-backed data (it's
>> software libraries) to different applications. No application wants or
>> needs to have the entire NFS tree exposed to it, but each application
>> wants to use some data available on NFS and wants it to be presented in
>> some particular local place. So I actually wanted a method where I
>> author a metadata-only layer external to overlayfs, built to spec.
>>
>> Essentially it's making overlayfs redirects be my symlinks so that code
>> which doesn't follow symlinks or is otherwise influenced by them is none
>> the wiser.
>>
> 
> Nice.
> I've always wished that data-only would not be an "offline-only" feature,
> but getting the official API for that scheme right might be a challenge.
> 
>>>> My first question is how bad can the "undefined behaviour" be in this
>>>> kind of setup?
>>>
>>> The behavior is "undefined" because nobody tried to define it,
>>> document it and test it.
>>> I don't think it would be that "bad", but it will be unpredictable
>>> and is not very nice for a software product.
>>>
>>> One of the current problems is that overlayfs uses readdir cache
>>> the readdir cache is not auto invalidated when lower dir changes
>>> so whether or not new subdirs are observed in overlay depends
>>> on whether the merged overlay directory is kept in cache or not.
>>>
>>
>> My approach doesn't support adding new files from the data-only NFS
>> layer after the overlayfs is created, of course, since the metadata-only
>> layer is itself the first lower layer and so would presumably get into
>> undefined-land if added to. But this arrangement does probably
>> mitigate this problem. Creating metadata inodes of a fixed set of
>> libraries for a specific application is cheap enough (and considerably
>> faster than copying it all locally) that the immutablity limitation
>> works for me.
>>
> 
> Assuming that this "effectively-data-only" NFS layer is never iterated via
> overlayfs then adding new unreferenced objects to this layer should not
> be a problem either.
> 
>>>> Any files that get copied up to the upper layer are
>>>> guaranteed to never change in the lower NFS filesystem (by it's
>>>> design), but new directories and files that have not yet been copied
>>>> up, can randomly appear over time. Deletions are not so important
>>>> because if it has been deleted in the lower level, then the upper
>>>> level copy failing has similar results (but we should cleanup the
>>>> upper layer too).
>>>>
>>>> If it's possible to get over this first difficult hurdle, then I have
>>>> another extra bit of complexity to throw on top - now manually make an
>>>> entire directory tree (of metdata) that we have recursively copied up
>>>> "opaque" in the upper layer (currently needs to be done outside of
>>>> overlayfs). Over time or dropping of caches, I have found that this
>>>> (seamlessly?) takes effect for new lookups.
>>>>
>>>> I also noticed that in the current implementation, this "opaque"
>>>> transition actual breaks access to the file because the metadata
>>>> copy-up sets "trusted.overlay.metacopy" but does not currently add an
>>>> explicit "trusted.overlay.redirect" to the correspnding lower layer
>>>> file. But if it did (or we do it manually with setfattr), then it is
>>>> possible to have an upper level directory that is opaque, contains
>>>> file metadata only and redirects to the data to the real files on the
>>>> lower NFS filesystem.
>>
>> So once you use opaque dirs and redirects on an upper layer, it's
>> sounding very similar to redirects into a data-only layer. In either
>> case you're responsible for producing metadata inodes for each NFS file
>> you want presented to the application/user.
>>
> 
> Yes, it is almost the same as data-only layer.
> The only difference is that real data-only layer can never be accessed
> directly from overlay, while the effectively-data-only layer must have
> some path (e.g /blobs) accessible directly from overlay in order to do
> online rename of blobs into the upper opaque layer.
> 
>> This way seems interesting and more promising for adding NFS-backed
>> files "online" though.
>>
>>> how can we document it to make the behavior "defined"?
>>>
>>> My thinking is:
>>>
>>> "Changes to the underlying filesystems while part of a mounted overlay
>>> filesystem are not allowed.  If the underlying filesystem is changed,
>>> the behavior of the overlay is undefined, though it will not result in
>>> a crash or deadlock.
>>>
>>> One exception to this rule is changes to underlying filesystem objects
>>> that were not accessed by a overlayfs prior to the change.
>>> In other words, once accessed from a mounted overlay filesystem,
>>> changes to the underlying filesystem objects are not allowed."
>>>
>>> But this claim needs to be proved and tested (write tests),
>>> before the documentation defines this behavior.
>>> I am not even sure if the claim is correct.
>>
>> I've been blissfully and naively assuming that it is based on intuition
>> :).
> 
> Yes, what overlay did not observe, overlay cannot know about.
> But the devil is in the details, such as what is an "accessed
> filesystem object".
> 
> In our case study, we refer to the newly added directory entries
> and new inodes "never accessed by overlayfs", so it sounds
> safe to add them while overlayfs is mounted, but their parent directory,
> even if never iterated via overlayfs was indeed accessed by overlayfs
> (when looking up for existing siblings), so overlayfs did access
> the lower parent directory and it does reference the lower parent
> directory dentry/inode, so it is still not "intuitively" safe to change it.
> 
>>
>> I think Daire and I are basically only adding new files to the NFS
>> filesystem, and both the all-opaque approach and the data-only approach
>> could prevent accidental access to things on the NFS filesystem through
>> the overlayfs (or at least portion of it meant for end-user consumption)
>> while they are still being birthed and might be experiencing changes.
>> At some point in the NFS tree, directories must be modified, but since
>> both approaches have overlayfs sourcing all directory entries from local
>> metadata-only layers, it seems plausible that the directories that
>> change aren't really "accessed by a overlayfs prior to the change."
>>
>> How much proving/testing would you want to see before documenting this
>> and supporting someone in future who finds a way to prove the claim
>> wrong?
>>
> 
> *very* good question :)
> 
> For testing, an xfstest will do - you can fork one of the existing
> data-only tests as a template.

Due to the extended delay in a substantive response, I just wanted to
send a quick thank you for your reply and suggestions here. I am still
interested in pursuing this, but I have been busy and then recovering
from illness.

I'll need to study how xfstest directly exercises overlayfs and how it
is combined with unionmount-testsuite I think.

> 
> For documentation, I think it is too hard to commit to the general
> statement above.
> 
> Try to narrow the exception to the rule to the very specific use case
> of "append-only" instead of "immutable" lower directory and then
> state that the behavior is "defined" - the new entries are either visible
> by overlayfs or they are not visible, and the "undefined" element
> is *when* they become visible and via which API (*).
> 
> (*) New entries may be visible to lookup and invisible to readdir
>      due to overlayfs readdir cache, and entries could be visible to
>      readdir and invisible to lookup, due to vfs negative lookup cache.
> 
> Note that the behavior of POSIX readdir() for entries added while
> an open dir fd is being iterated is similar - the new entries will either
> be visible in the iteration of that fd or they won't be, but there is
> a clear "barrier" when the new entries will become visible
> (on seek to start or open of new fd)>
>>>
>>> One more thing that could help said service is if overlayfs
>>> supported a hybrid mode of redirect_dir=follow,metacopy=on,
>>> where redirect is enabled for regular files for metacopy, but NOT
>>> enabled for directories (which was redirect_dir original use case).
>>>
>>> This way, the service could run the command line:
>>> $ mv /ovl/blah/thing /ovl/local
>>> then "mv" will get EXDEV for moving directories and will create
>>> opaque directories in their place and it will recursively move all
>>> the files to the opaque directories.
>>
>> Clever.
> 
> Feel free to post this patch if you find it useful.
> The commit message should say that the mount option
> check does not reflect the actual dependency in the code,
> and it should also explain very well why this mount option combination
> is desired and lore Link: to this conversation.
> 
> Thanks,
> Amir.