Re: overlayfs: NFS lowerdir changes & opaque negative lookups

Alexander Larsson <alexl@xxxxxxxxxx> · Tue, 6 Aug 2024 09:55:59 +0200

On Mon, Jul 29, 2024 at 9:29 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> On Sun, Jul 28, 2024 at 11:33 PM Mike Baynton <mike@xxxxxxxxxxxx> wrote:
> >
> > On 7/21/24 22:31, Amir Goldstein wrote:
> > >
> > >
> > > On Mon, Jul 22, 2024, 6:02 AM Mike Baynton <mike@xxxxxxxxxxxx
> > > <mailto:mike@xxxxxxxxxxxx>> wrote:
> > >
> > > On 7/12/24 04:09, Amir Goldstein wrote:
> > >> On Fri, Jul 12, 2024 at 6:24 AM Mike Baynton <mike@xxxxxxxxxxxx
> > > <mailto:mike@xxxxxxxxxxxx>> wrote:
> > >>>
> > >>> On 7/11/24 18:30, Amir Goldstein wrote:
> > >>>> On Thu, Jul 11, 2024 at 6:59 PM Daire Byrne <daire@xxxxxxxx
> > > <mailto:daire@xxxxxxxx>> wrote:
> > >>>>> Basically I have a read-only NFS filesystem with software
> > >>>>> releases that are versioned such that no files are ever
> > >>>>> overwritten or
> > > changed.
> > >>>>> New uniquely named directory trees and files are added from
> > >>>>> time to time and older ones are cleaned up.
> > >>>>>
> > >>>>
> > >>>> Sounds like a common use case that many people are interested
> > >>>> in.
> > >>>
> > >>> I can vouch that that's accurate, I'm doing nearly the same
> > > thing. The
> > >>> properties of the NFS filesystem in terms of what is and is not
> > > expected
> > >>> to change is identical for me, though my approach to
> > >>> incorporating overlayfs has been a little different.
> > >>>
> > >>> My confidence in the reliability of what I'm doing is still far
> > >>> from absolute, so I will be interested in efforts to
> > >>> validate/officially sanction/support/document related
> > >>> techniques.
> > >>>
> > >>> The way I am doing it is with NFS as a data-only layer.
> > >>> Basically
> > > my use
> > >>> case calls for presenting different views of NFS-backed data
> > >>> (it's software libraries) to different applications. No
> > >>> application
> > > wants or
> > >>> needs to have the entire NFS tree exposed to it, but each
> > >>> application wants to use some data available on NFS and wants it
> > >>> to be
> > > presented in
> > >>> some particular local place. So I actually wanted a method where
> > >>> I author a metadata-only layer external to overlayfs, built to
> > >>> spec.
> > >>>
> > >>> Essentially it's making overlayfs redirects be my symlinks so
> > > that code
> > >>> which doesn't follow symlinks or is otherwise influenced by them
> > > is none
> > >>> the wiser.
> > >>>
> > >>
> > >> Nice. I've always wished that data-only would not be an
> > >> "offline-only"
> > > feature,
> > >> but getting the official API for that scheme right might be a
> > > challenge.
> > >>
> > >>>>> My first question is how bad can the "undefined behaviour"
> > >>>>> be
> > > in this
> > >>>>> kind of setup?
> > >>>>
> > >>>> The behavior is "undefined" because nobody tried to define it,
> > >>>> document it and test it. I don't think it would be that "bad",
> > >>>> but it will be unpredictable and is not very nice for a
> > >>>> software product.
> > >>>>
> > >>>> One of the current problems is that overlayfs uses readdir
> > >>>> cache the readdir cache is not auto invalidated when lower dir
> > >>>> changes so whether or not new subdirs are observed in overlay
> > >>>> depends on whether the merged overlay directory is kept in
> > >>>> cache or not.
> > >>>>
> > >>>
> > >>> My approach doesn't support adding new files from the data-only
> > >>> NFS layer after the overlayfs is created, of course, since the
> > > metadata-only
> > >>> layer is itself the first lower layer and so would presumably
> > >>> get
> > > into
> > >>> undefined-land if added to. But this arrangement does probably
> > >>> mitigate this problem. Creating metadata inodes of a fixed set
> > >>> of libraries for a specific application is cheap enough (and
> > > considerably
> > >>> faster than copying it all locally) that the immutablity
> > >>> limitation works for me.
> > >>>
> > >>
> > >> Assuming that this "effectively-data-only" NFS layer is never
> > > iterated via
> > >> overlayfs then adding new unreferenced objects to this layer
> > > should not
> > >> be a problem either.
> > >>
> > >>>>> Any files that get copied up to the upper layer are
> > >>>>> guaranteed to never change in the lower NFS filesystem (by
> > >>>>> it's design), but new directories and files that have not yet
> > >>>>> been
> > > copied
> > >>>>> up, can randomly appear over time. Deletions are not so
> > >>>>> important because if it has been deleted in the lower level,
> > >>>>> then the upper level copy failing has similar results (but we
> > >>>>> should cleanup the upper layer too).
> > >>>>>
> > >>>>> If it's possible to get over this first difficult hurdle,
> > >>>>> then
> > > I have
> > >>>>> another extra bit of complexity to throw on top - now
> > >>>>> manually
> > > make an
> > >>>>> entire directory tree (of metdata) that we have recursively
> > > copied up
> > >>>>> "opaque" in the upper layer (currently needs to be done
> > >>>>> outside of overlayfs). Over time or dropping of caches, I
> > >>>>> have found that this (seamlessly?) takes effect for new
> > >>>>> lookups.
> > >>>>>
> > >>>>> I also noticed that in the current implementation, this
> > >>>>> "opaque" transition actual breaks access to the file because
> > >>>>> the metadata copy-up sets "trusted.overlay.metacopy" but does
> > >>>>> not currently
> > > add an
> > >>>>> explicit "trusted.overlay.redirect" to the correspnding lower
> > >>>>> layer file. But if it did (or we do it manually with
> > >>>>> setfattr), then
> > > it is
> > >>>>> possible to have an upper level directory that is opaque,
> > >>>>> contains file metadata only and redirects to the data to the
> > >>>>> real files
> > > on the
> > >>>>> lower NFS filesystem.
> > >>>
> > >>> So once you use opaque dirs and redirects on an upper layer,
> > >>> it's sounding very similar to redirects into a data-only layer.
> > >>> In either case you're responsible for producing metadata inodes
> > >>> for each
> > > NFS file
> > >>> you want presented to the application/user.
> > >>>
> > >>
> > >> Yes, it is almost the same as data-only layer. The only difference
> > >> is that real data-only layer can never be accessed directly from
> > >> overlay, while the effectively-data-only layer must have some path
> > >> (e.g /blobs) accessible directly from overlay in order to do online
> > >> rename of blobs into the upper opaque layer.
> > >>
> > >>> This way seems interesting and more promising for adding
> > >>> NFS-backed files "online" though.
> > >>>
> > >>>> how can we document it to make the behavior "defined"?
> > >>>>
> > >>>> My thinking is:
> > >>>>
> > >>>> "Changes to the underlying filesystems while part of a mounted
> > > overlay
> > >>>> filesystem are not allowed.  If the underlying filesystem is
> > > changed,
> > >>>> the behavior of the overlay is undefined, though it will not
> > > result in
> > >>>> a crash or deadlock.
> > >>>>
> > >>>> One exception to this rule is changes to underlying filesystem
> > > objects
> > >>>> that were not accessed by a overlayfs prior to the change. In
> > >>>> other words, once accessed from a mounted overlay filesystem,
> > >>>> changes to the underlying filesystem objects are not allowed."
> > >>>>
> > >>>> But this claim needs to be proved and tested (write tests),
> > >>>> before the documentation defines this behavior. I am not even
> > >>>> sure if the claim is correct.
> > >>>
> > >>> I've been blissfully and naively assuming that it is based on
> > > intuition
> > >>> :).
> > >>
> > >> Yes, what overlay did not observe, overlay cannot know about. But
> > >> the devil is in the details, such as what is an "accessed
> > >> filesystem object".
> > >>
> > >> In our case study, we refer to the newly added directory entries
> > >> and new inodes "never accessed by overlayfs", so it sounds safe to
> > >> add them while overlayfs is mounted, but their parent
> > > directory,
> > >> even if never iterated via overlayfs was indeed accessed by
> > >> overlayfs (when looking up for existing siblings), so overlayfs did
> > >> access the lower parent directory and it does reference the lower
> > >> parent directory dentry/inode, so it is still not "intuitively"
> > >> safe to
> > > change it.
> >
> > This makes sense. I've been sure to cause the directory in the data-only
> > layer that subsequently experiences an "append" to be consulted to
> > lookup a different file before the append.
> >
> > >>
> > >>>
> > >>> I think Daire and I are basically only adding new files to the
> > >>> NFS filesystem, and both the all-opaque approach and the
> > >>> data-only
> > > approach
> > >>> could prevent accidental access to things on the NFS filesystem
> > > through
> > >>> the overlayfs (or at least portion of it meant for end-user
> > > consumption)
> > >>> while they are still being birthed and might be experiencing
> > >>> changes. At some point in the NFS tree, directories must be
> > >>> modified, but
> > > since
> > >>> both approaches have overlayfs sourcing all directory entries
> > > from local
> > >>> metadata-only layers, it seems plausible that the directories
> > >>> that change aren't really "accessed by a overlayfs prior to the
> > >>> change."
> > >>>
> > >>> How much proving/testing would you want to see before
> > >>> documenting
> > > this
> > >>> and supporting someone in future who finds a way to prove the
> > >>> claim wrong?
> > >>>
> > >>
> > >> *very* good question :)
> > >>
> > >> For testing, an xfstest will do - you can fork one of the existing
> > >> data-only tests as a template>
> > > Due to the extended delay in a substantive response, I just wanted
> > > to send a quick thank you for your reply and suggestions here. I am
> > > still interested in pursuing this, but I have been busy and then
> > > recovering from illness.
> > >
> > > I'll need to study how xfstest directly exercises overlayfs and how
> > > it is combined with unionmount-testsuite I think.
> > >
> > >
> > > Running unionmount-testsuite from fstests is optional not a must for
> > > developing an fastest.
> > >
> > > See README.overlay in fstests for quick start With testing overlays.
> > >
> > > Thanks, Amir.
> > >
> > >
> > >>
> > >> For documentation, I think it is too hard to commit to the general
> > >> statement above.
> > >>
> > >> Try to narrow the exception to the rule to the very specific use
> > >> case of "append-only" instead of "immutable" lower directory and
> > >> then state that the behavior is "defined" - the new entries are
> > >> either
> > > visible
> > >> by overlayfs or they are not visible, and the "undefined" element
> > >> is *when* they become visible and via which API (*).
> > >>
> > >> (*) New entries may be visible to lookup and invisible to readdir
> > >> due to overlayfs readdir cache, and entries could be visible to
> > >> readdir and invisible to lookup, due to vfs negative lookup
> > > cache.
> >
> > So I've gotten a test going that focuses on really just two behaviors
> > that would satisfy my use case and that seem to currently be true.
> > Tightening the claims to a few narrow -- and hopefully thus needing
> > little to no effort to support -- statements seems like a good idea to
> > me, though in thinking through my use case, the behaviors I attempt to
> > make defined are a little different from how I read the idea above. That
> > seems to be inclusive of regular lower layers, where files might or
> > might not be accessible through regular merge. It looks like your
> > finalize patch is more oriented towards establishing useful defined
> > behaviors in case of modifications to regular lower layers, as well as
> > general performance. I thought I could probably go even simpler.
> >
> > Because I simply want to add new software versions to the big underlying
> > data-only filesystem periodically but am happy to create new overlayfs
> > mounts complete with new "middle"/"redirect" layers to the new versions,
> > I just focus on establishing the safety of append-only additions to a
> > data-only layer that's part of a mounted overlayfs.
> > The only real things I need defined are that appending a file to the
> > data-only layer does not create undefined behavior in the existing
> > overlayfs, and that the newly appended file is fully accessible for
> > iteration and lookup in a new overlayfs, regardless of the file access
> > patterns through any overlayfs that uses the data-only filesystem as a
> > data-only layer.
> >
> > The defined behaviors are:
> >  * A file added to a data-only layer while mounted will not appear in
> >    the overlayfs via readdir or lookup, but it is safe for applications
> >    to attempt to do so.
> >  * A subsequently mounted overlayfs that includes redirects to the added
> >    files will be able to iterate and open the added files.
> >
> > So the test is my attempt to create the least favorable conditions /
> > most likely conditions to break the defined behaviors. Of course testing
> > for "lack of undefined" behavior is open-ended in some sense. The test
> > conforms to the tightly defined write patterns, but since we don't
> > restrict the read patterns against overlayfs there might be other
> > interesting cases to validate there.
>
> This feels like a good practical approach.
>
> As I wrote in comment on your test patch, this is behavior how all data-only
> overlayfs works, because the data-only layer is always going to be a layer
> that is shared among many overlayfs, so at any given time, there would be
> an online overlayfs when blobs are added to the data-only layer to compose
> new images.
>
> It is good to make this behavior known and explicit - I am just saying
> that it is implied by the data-only layers features, because it would
> have been useless otherwise.

I agree that it is nice to have this be explicit, clearly e.g.
composefs (at least the expected usecase of it) would need this. I
never even considered that this would not be the case though, as why
would separate mounts affect each other.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@xxxxxxxxxx         alexander.larsson@xxxxxxxxx