Re: overlayfs: NFS lowerdir changes & opaque negative lookups

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 29 Jul 2024 10:29:38 +0300

On Sun, Jul 28, 2024 at 11:33 PM Mike Baynton <mike@xxxxxxxxxxxx> wrote:
>
> On 7/21/24 22:31, Amir Goldstein wrote:
> >
> >
> > On Mon, Jul 22, 2024, 6:02 AM Mike Baynton <mike@xxxxxxxxxxxx
> > <mailto:mike@xxxxxxxxxxxx>> wrote:
> >
> > On 7/12/24 04:09, Amir Goldstein wrote:
> >> On Fri, Jul 12, 2024 at 6:24 AM Mike Baynton <mike@xxxxxxxxxxxx
> > <mailto:mike@xxxxxxxxxxxx>> wrote:
> >>>
> >>> On 7/11/24 18:30, Amir Goldstein wrote:
> >>>> On Thu, Jul 11, 2024 at 6:59 PM Daire Byrne <daire@xxxxxxxx
> > <mailto:daire@xxxxxxxx>> wrote:
> >>>>> Basically I have a read-only NFS filesystem with software
> >>>>> releases that are versioned such that no files are ever
> >>>>> overwritten or
> > changed.
> >>>>> New uniquely named directory trees and files are added from
> >>>>> time to time and older ones are cleaned up.
> >>>>>
> >>>>
> >>>> Sounds like a common use case that many people are interested
> >>>> in.
> >>>
> >>> I can vouch that that's accurate, I'm doing nearly the same
> > thing. The
> >>> properties of the NFS filesystem in terms of what is and is not
> > expected
> >>> to change is identical for me, though my approach to
> >>> incorporating overlayfs has been a little different.
> >>>
> >>> My confidence in the reliability of what I'm doing is still far
> >>> from absolute, so I will be interested in efforts to
> >>> validate/officially sanction/support/document related
> >>> techniques.
> >>>
> >>> The way I am doing it is with NFS as a data-only layer.
> >>> Basically
> > my use
> >>> case calls for presenting different views of NFS-backed data
> >>> (it's software libraries) to different applications. No
> >>> application
> > wants or
> >>> needs to have the entire NFS tree exposed to it, but each
> >>> application wants to use some data available on NFS and wants it
> >>> to be
> > presented in
> >>> some particular local place. So I actually wanted a method where
> >>> I author a metadata-only layer external to overlayfs, built to
> >>> spec.
> >>>
> >>> Essentially it's making overlayfs redirects be my symlinks so
> > that code
> >>> which doesn't follow symlinks or is otherwise influenced by them
> > is none
> >>> the wiser.
> >>>
> >>
> >> Nice. I've always wished that data-only would not be an
> >> "offline-only"
> > feature,
> >> but getting the official API for that scheme right might be a
> > challenge.
> >>
> >>>>> My first question is how bad can the "undefined behaviour"
> >>>>> be
> > in this
> >>>>> kind of setup?
> >>>>
> >>>> The behavior is "undefined" because nobody tried to define it,
> >>>> document it and test it. I don't think it would be that "bad",
> >>>> but it will be unpredictable and is not very nice for a
> >>>> software product.
> >>>>
> >>>> One of the current problems is that overlayfs uses readdir
> >>>> cache the readdir cache is not auto invalidated when lower dir
> >>>> changes so whether or not new subdirs are observed in overlay
> >>>> depends on whether the merged overlay directory is kept in
> >>>> cache or not.
> >>>>
> >>>
> >>> My approach doesn't support adding new files from the data-only
> >>> NFS layer after the overlayfs is created, of course, since the
> > metadata-only
> >>> layer is itself the first lower layer and so would presumably
> >>> get
> > into
> >>> undefined-land if added to. But this arrangement does probably
> >>> mitigate this problem. Creating metadata inodes of a fixed set
> >>> of libraries for a specific application is cheap enough (and
> > considerably
> >>> faster than copying it all locally) that the immutablity
> >>> limitation works for me.
> >>>
> >>
> >> Assuming that this "effectively-data-only" NFS layer is never
> > iterated via
> >> overlayfs then adding new unreferenced objects to this layer
> > should not
> >> be a problem either.
> >>
> >>>>> Any files that get copied up to the upper layer are
> >>>>> guaranteed to never change in the lower NFS filesystem (by
> >>>>> it's design), but new directories and files that have not yet
> >>>>> been
> > copied
> >>>>> up, can randomly appear over time. Deletions are not so
> >>>>> important because if it has been deleted in the lower level,
> >>>>> then the upper level copy failing has similar results (but we
> >>>>> should cleanup the upper layer too).
> >>>>>
> >>>>> If it's possible to get over this first difficult hurdle,
> >>>>> then
> > I have
> >>>>> another extra bit of complexity to throw on top - now
> >>>>> manually
> > make an
> >>>>> entire directory tree (of metdata) that we have recursively
> > copied up
> >>>>> "opaque" in the upper layer (currently needs to be done
> >>>>> outside of overlayfs). Over time or dropping of caches, I
> >>>>> have found that this (seamlessly?) takes effect for new
> >>>>> lookups.
> >>>>>
> >>>>> I also noticed that in the current implementation, this
> >>>>> "opaque" transition actual breaks access to the file because
> >>>>> the metadata copy-up sets "trusted.overlay.metacopy" but does
> >>>>> not currently
> > add an
> >>>>> explicit "trusted.overlay.redirect" to the correspnding lower
> >>>>> layer file. But if it did (or we do it manually with
> >>>>> setfattr), then
> > it is
> >>>>> possible to have an upper level directory that is opaque,
> >>>>> contains file metadata only and redirects to the data to the
> >>>>> real files
> > on the
> >>>>> lower NFS filesystem.
> >>>
> >>> So once you use opaque dirs and redirects on an upper layer,
> >>> it's sounding very similar to redirects into a data-only layer.
> >>> In either case you're responsible for producing metadata inodes
> >>> for each
> > NFS file
> >>> you want presented to the application/user.
> >>>
> >>
> >> Yes, it is almost the same as data-only layer. The only difference
> >> is that real data-only layer can never be accessed directly from
> >> overlay, while the effectively-data-only layer must have some path
> >> (e.g /blobs) accessible directly from overlay in order to do online
> >> rename of blobs into the upper opaque layer.
> >>
> >>> This way seems interesting and more promising for adding
> >>> NFS-backed files "online" though.
> >>>
> >>>> how can we document it to make the behavior "defined"?
> >>>>
> >>>> My thinking is:
> >>>>
> >>>> "Changes to the underlying filesystems while part of a mounted
> > overlay
> >>>> filesystem are not allowed.  If the underlying filesystem is
> > changed,
> >>>> the behavior of the overlay is undefined, though it will not
> > result in
> >>>> a crash or deadlock.
> >>>>
> >>>> One exception to this rule is changes to underlying filesystem
> > objects
> >>>> that were not accessed by a overlayfs prior to the change. In
> >>>> other words, once accessed from a mounted overlay filesystem,
> >>>> changes to the underlying filesystem objects are not allowed."
> >>>>
> >>>> But this claim needs to be proved and tested (write tests),
> >>>> before the documentation defines this behavior. I am not even
> >>>> sure if the claim is correct.
> >>>
> >>> I've been blissfully and naively assuming that it is based on
> > intuition
> >>> :).
> >>
> >> Yes, what overlay did not observe, overlay cannot know about. But
> >> the devil is in the details, such as what is an "accessed
> >> filesystem object".
> >>
> >> In our case study, we refer to the newly added directory entries
> >> and new inodes "never accessed by overlayfs", so it sounds safe to
> >> add them while overlayfs is mounted, but their parent
> > directory,
> >> even if never iterated via overlayfs was indeed accessed by
> >> overlayfs (when looking up for existing siblings), so overlayfs did
> >> access the lower parent directory and it does reference the lower
> >> parent directory dentry/inode, so it is still not "intuitively"
> >> safe to
> > change it.
>
> This makes sense. I've been sure to cause the directory in the data-only
> layer that subsequently experiences an "append" to be consulted to
> lookup a different file before the append.
>
> >>
> >>>
> >>> I think Daire and I are basically only adding new files to the
> >>> NFS filesystem, and both the all-opaque approach and the
> >>> data-only
> > approach
> >>> could prevent accidental access to things on the NFS filesystem
> > through
> >>> the overlayfs (or at least portion of it meant for end-user
> > consumption)
> >>> while they are still being birthed and might be experiencing
> >>> changes. At some point in the NFS tree, directories must be
> >>> modified, but
> > since
> >>> both approaches have overlayfs sourcing all directory entries
> > from local
> >>> metadata-only layers, it seems plausible that the directories
> >>> that change aren't really "accessed by a overlayfs prior to the
> >>> change."
> >>>
> >>> How much proving/testing would you want to see before
> >>> documenting
> > this
> >>> and supporting someone in future who finds a way to prove the
> >>> claim wrong?
> >>>
> >>
> >> *very* good question :)
> >>
> >> For testing, an xfstest will do - you can fork one of the existing
> >> data-only tests as a template>
> > Due to the extended delay in a substantive response, I just wanted
> > to send a quick thank you for your reply and suggestions here. I am
> > still interested in pursuing this, but I have been busy and then
> > recovering from illness.
> >
> > I'll need to study how xfstest directly exercises overlayfs and how
> > it is combined with unionmount-testsuite I think.
> >
> >
> > Running unionmount-testsuite from fstests is optional not a must for
> > developing an fastest.
> >
> > See README.overlay in fstests for quick start With testing overlays.
> >
> > Thanks, Amir.
> >
> >
> >>
> >> For documentation, I think it is too hard to commit to the general
> >> statement above.
> >>
> >> Try to narrow the exception to the rule to the very specific use
> >> case of "append-only" instead of "immutable" lower directory and
> >> then state that the behavior is "defined" - the new entries are
> >> either
> > visible
> >> by overlayfs or they are not visible, and the "undefined" element
> >> is *when* they become visible and via which API (*).
> >>
> >> (*) New entries may be visible to lookup and invisible to readdir
> >> due to overlayfs readdir cache, and entries could be visible to
> >> readdir and invisible to lookup, due to vfs negative lookup
> > cache.
>
> So I've gotten a test going that focuses on really just two behaviors
> that would satisfy my use case and that seem to currently be true.
> Tightening the claims to a few narrow -- and hopefully thus needing
> little to no effort to support -- statements seems like a good idea to
> me, though in thinking through my use case, the behaviors I attempt to
> make defined are a little different from how I read the idea above. That
> seems to be inclusive of regular lower layers, where files might or
> might not be accessible through regular merge. It looks like your
> finalize patch is more oriented towards establishing useful defined
> behaviors in case of modifications to regular lower layers, as well as
> general performance. I thought I could probably go even simpler.
>
> Because I simply want to add new software versions to the big underlying
> data-only filesystem periodically but am happy to create new overlayfs
> mounts complete with new "middle"/"redirect" layers to the new versions,
> I just focus on establishing the safety of append-only additions to a
> data-only layer that's part of a mounted overlayfs.
> The only real things I need defined are that appending a file to the
> data-only layer does not create undefined behavior in the existing
> overlayfs, and that the newly appended file is fully accessible for
> iteration and lookup in a new overlayfs, regardless of the file access
> patterns through any overlayfs that uses the data-only filesystem as a
> data-only layer.
>
> The defined behaviors are:
>  * A file added to a data-only layer while mounted will not appear in
>    the overlayfs via readdir or lookup, but it is safe for applications
>    to attempt to do so.
>  * A subsequently mounted overlayfs that includes redirects to the added
>    files will be able to iterate and open the added files.
>
> So the test is my attempt to create the least favorable conditions /
> most likely conditions to break the defined behaviors. Of course testing
> for "lack of undefined" behavior is open-ended in some sense. The test
> conforms to the tightly defined write patterns, but since we don't
> restrict the read patterns against overlayfs there might be other
> interesting cases to validate there.

This feels like a good practical approach.

As I wrote in comment on your test patch, this is behavior how all data-only
overlayfs works, because the data-only layer is always going to be a layer
that is shared among many overlayfs, so at any given time, there would be
an online overlayfs when blobs are added to the data-only layer to compose
new images.

It is good to make this behavior known and explicit - I am just saying
that it is implied by the data-only layers features, because it would
have been useless otherwise.

I also think that this behavior almost does not contradict the
documentation, because the documentation does not explicitly
mentions composing new layers offline, which is currently
a gray area.

I think we could add an exception to the "Changes to underlying
filesystems" section regarding "Offline changes, when the overlay
is not mounted" that explicitly allows to append files to a data-only
layer, even with new features enabled.

Thanks,
Amir.