On Sun, Jul 28, 2024 at 11:33 PM Mike Baynton <mike@xxxxxxxxxxxx> wrote: > > On 7/21/24 22:31, Amir Goldstein wrote: > > > > > > On Mon, Jul 22, 2024, 6:02 AM Mike Baynton <mike@xxxxxxxxxxxx > > <mailto:mike@xxxxxxxxxxxx>> wrote: > > > > On 7/12/24 04:09, Amir Goldstein wrote: > >> On Fri, Jul 12, 2024 at 6:24 AM Mike Baynton <mike@xxxxxxxxxxxx > > <mailto:mike@xxxxxxxxxxxx>> wrote: > >>> > >>> On 7/11/24 18:30, Amir Goldstein wrote: > >>>> On Thu, Jul 11, 2024 at 6:59 PM Daire Byrne <daire@xxxxxxxx > > <mailto:daire@xxxxxxxx>> wrote: > >>>>> Basically I have a read-only NFS filesystem with software > >>>>> releases that are versioned such that no files are ever > >>>>> overwritten or > > changed. > >>>>> New uniquely named directory trees and files are added from > >>>>> time to time and older ones are cleaned up. > >>>>> > >>>> > >>>> Sounds like a common use case that many people are interested > >>>> in. > >>> > >>> I can vouch that that's accurate, I'm doing nearly the same > > thing. The > >>> properties of the NFS filesystem in terms of what is and is not > > expected > >>> to change is identical for me, though my approach to > >>> incorporating overlayfs has been a little different. > >>> > >>> My confidence in the reliability of what I'm doing is still far > >>> from absolute, so I will be interested in efforts to > >>> validate/officially sanction/support/document related > >>> techniques. > >>> > >>> The way I am doing it is with NFS as a data-only layer. > >>> Basically > > my use > >>> case calls for presenting different views of NFS-backed data > >>> (it's software libraries) to different applications. No > >>> application > > wants or > >>> needs to have the entire NFS tree exposed to it, but each > >>> application wants to use some data available on NFS and wants it > >>> to be > > presented in > >>> some particular local place. So I actually wanted a method where > >>> I author a metadata-only layer external to overlayfs, built to > >>> spec. > >>> > >>> Essentially it's making overlayfs redirects be my symlinks so > > that code > >>> which doesn't follow symlinks or is otherwise influenced by them > > is none > >>> the wiser. > >>> > >> > >> Nice. I've always wished that data-only would not be an > >> "offline-only" > > feature, > >> but getting the official API for that scheme right might be a > > challenge. > >> > >>>>> My first question is how bad can the "undefined behaviour" > >>>>> be > > in this > >>>>> kind of setup? > >>>> > >>>> The behavior is "undefined" because nobody tried to define it, > >>>> document it and test it. I don't think it would be that "bad", > >>>> but it will be unpredictable and is not very nice for a > >>>> software product. > >>>> > >>>> One of the current problems is that overlayfs uses readdir > >>>> cache the readdir cache is not auto invalidated when lower dir > >>>> changes so whether or not new subdirs are observed in overlay > >>>> depends on whether the merged overlay directory is kept in > >>>> cache or not. > >>>> > >>> > >>> My approach doesn't support adding new files from the data-only > >>> NFS layer after the overlayfs is created, of course, since the > > metadata-only > >>> layer is itself the first lower layer and so would presumably > >>> get > > into > >>> undefined-land if added to. But this arrangement does probably > >>> mitigate this problem. Creating metadata inodes of a fixed set > >>> of libraries for a specific application is cheap enough (and > > considerably > >>> faster than copying it all locally) that the immutablity > >>> limitation works for me. > >>> > >> > >> Assuming that this "effectively-data-only" NFS layer is never > > iterated via > >> overlayfs then adding new unreferenced objects to this layer > > should not > >> be a problem either. > >> > >>>>> Any files that get copied up to the upper layer are > >>>>> guaranteed to never change in the lower NFS filesystem (by > >>>>> it's design), but new directories and files that have not yet > >>>>> been > > copied > >>>>> up, can randomly appear over time. Deletions are not so > >>>>> important because if it has been deleted in the lower level, > >>>>> then the upper level copy failing has similar results (but we > >>>>> should cleanup the upper layer too). > >>>>> > >>>>> If it's possible to get over this first difficult hurdle, > >>>>> then > > I have > >>>>> another extra bit of complexity to throw on top - now > >>>>> manually > > make an > >>>>> entire directory tree (of metdata) that we have recursively > > copied up > >>>>> "opaque" in the upper layer (currently needs to be done > >>>>> outside of overlayfs). Over time or dropping of caches, I > >>>>> have found that this (seamlessly?) takes effect for new > >>>>> lookups. > >>>>> > >>>>> I also noticed that in the current implementation, this > >>>>> "opaque" transition actual breaks access to the file because > >>>>> the metadata copy-up sets "trusted.overlay.metacopy" but does > >>>>> not currently > > add an > >>>>> explicit "trusted.overlay.redirect" to the correspnding lower > >>>>> layer file. But if it did (or we do it manually with > >>>>> setfattr), then > > it is > >>>>> possible to have an upper level directory that is opaque, > >>>>> contains file metadata only and redirects to the data to the > >>>>> real files > > on the > >>>>> lower NFS filesystem. > >>> > >>> So once you use opaque dirs and redirects on an upper layer, > >>> it's sounding very similar to redirects into a data-only layer. > >>> In either case you're responsible for producing metadata inodes > >>> for each > > NFS file > >>> you want presented to the application/user. > >>> > >> > >> Yes, it is almost the same as data-only layer. The only difference > >> is that real data-only layer can never be accessed directly from > >> overlay, while the effectively-data-only layer must have some path > >> (e.g /blobs) accessible directly from overlay in order to do online > >> rename of blobs into the upper opaque layer. > >> > >>> This way seems interesting and more promising for adding > >>> NFS-backed files "online" though. > >>> > >>>> how can we document it to make the behavior "defined"? > >>>> > >>>> My thinking is: > >>>> > >>>> "Changes to the underlying filesystems while part of a mounted > > overlay > >>>> filesystem are not allowed. If the underlying filesystem is > > changed, > >>>> the behavior of the overlay is undefined, though it will not > > result in > >>>> a crash or deadlock. > >>>> > >>>> One exception to this rule is changes to underlying filesystem > > objects > >>>> that were not accessed by a overlayfs prior to the change. In > >>>> other words, once accessed from a mounted overlay filesystem, > >>>> changes to the underlying filesystem objects are not allowed." > >>>> > >>>> But this claim needs to be proved and tested (write tests), > >>>> before the documentation defines this behavior. I am not even > >>>> sure if the claim is correct. > >>> > >>> I've been blissfully and naively assuming that it is based on > > intuition > >>> :). > >> > >> Yes, what overlay did not observe, overlay cannot know about. But > >> the devil is in the details, such as what is an "accessed > >> filesystem object". > >> > >> In our case study, we refer to the newly added directory entries > >> and new inodes "never accessed by overlayfs", so it sounds safe to > >> add them while overlayfs is mounted, but their parent > > directory, > >> even if never iterated via overlayfs was indeed accessed by > >> overlayfs (when looking up for existing siblings), so overlayfs did > >> access the lower parent directory and it does reference the lower > >> parent directory dentry/inode, so it is still not "intuitively" > >> safe to > > change it. > > This makes sense. I've been sure to cause the directory in the data-only > layer that subsequently experiences an "append" to be consulted to > lookup a different file before the append. > > >> > >>> > >>> I think Daire and I are basically only adding new files to the > >>> NFS filesystem, and both the all-opaque approach and the > >>> data-only > > approach > >>> could prevent accidental access to things on the NFS filesystem > > through > >>> the overlayfs (or at least portion of it meant for end-user > > consumption) > >>> while they are still being birthed and might be experiencing > >>> changes. At some point in the NFS tree, directories must be > >>> modified, but > > since > >>> both approaches have overlayfs sourcing all directory entries > > from local > >>> metadata-only layers, it seems plausible that the directories > >>> that change aren't really "accessed by a overlayfs prior to the > >>> change." > >>> > >>> How much proving/testing would you want to see before > >>> documenting > > this > >>> and supporting someone in future who finds a way to prove the > >>> claim wrong? > >>> > >> > >> *very* good question :) > >> > >> For testing, an xfstest will do - you can fork one of the existing > >> data-only tests as a template> > > Due to the extended delay in a substantive response, I just wanted > > to send a quick thank you for your reply and suggestions here. I am > > still interested in pursuing this, but I have been busy and then > > recovering from illness. > > > > I'll need to study how xfstest directly exercises overlayfs and how > > it is combined with unionmount-testsuite I think. > > > > > > Running unionmount-testsuite from fstests is optional not a must for > > developing an fastest. > > > > See README.overlay in fstests for quick start With testing overlays. > > > > Thanks, Amir. > > > > > >> > >> For documentation, I think it is too hard to commit to the general > >> statement above. > >> > >> Try to narrow the exception to the rule to the very specific use > >> case of "append-only" instead of "immutable" lower directory and > >> then state that the behavior is "defined" - the new entries are > >> either > > visible > >> by overlayfs or they are not visible, and the "undefined" element > >> is *when* they become visible and via which API (*). > >> > >> (*) New entries may be visible to lookup and invisible to readdir > >> due to overlayfs readdir cache, and entries could be visible to > >> readdir and invisible to lookup, due to vfs negative lookup > > cache. > > So I've gotten a test going that focuses on really just two behaviors > that would satisfy my use case and that seem to currently be true. > Tightening the claims to a few narrow -- and hopefully thus needing > little to no effort to support -- statements seems like a good idea to > me, though in thinking through my use case, the behaviors I attempt to > make defined are a little different from how I read the idea above. That > seems to be inclusive of regular lower layers, where files might or > might not be accessible through regular merge. It looks like your > finalize patch is more oriented towards establishing useful defined > behaviors in case of modifications to regular lower layers, as well as > general performance. I thought I could probably go even simpler. > > Because I simply want to add new software versions to the big underlying > data-only filesystem periodically but am happy to create new overlayfs > mounts complete with new "middle"/"redirect" layers to the new versions, > I just focus on establishing the safety of append-only additions to a > data-only layer that's part of a mounted overlayfs. > The only real things I need defined are that appending a file to the > data-only layer does not create undefined behavior in the existing > overlayfs, and that the newly appended file is fully accessible for > iteration and lookup in a new overlayfs, regardless of the file access > patterns through any overlayfs that uses the data-only filesystem as a > data-only layer. > > The defined behaviors are: > * A file added to a data-only layer while mounted will not appear in > the overlayfs via readdir or lookup, but it is safe for applications > to attempt to do so. > * A subsequently mounted overlayfs that includes redirects to the added > files will be able to iterate and open the added files. > > So the test is my attempt to create the least favorable conditions / > most likely conditions to break the defined behaviors. Of course testing > for "lack of undefined" behavior is open-ended in some sense. The test > conforms to the tightly defined write patterns, but since we don't > restrict the read patterns against overlayfs there might be other > interesting cases to validate there. This feels like a good practical approach. As I wrote in comment on your test patch, this is behavior how all data-only overlayfs works, because the data-only layer is always going to be a layer that is shared among many overlayfs, so at any given time, there would be an online overlayfs when blobs are added to the data-only layer to compose new images. It is good to make this behavior known and explicit - I am just saying that it is implied by the data-only layers features, because it would have been useless otherwise. I also think that this behavior almost does not contradict the documentation, because the documentation does not explicitly mentions composing new layers offline, which is currently a gray area. I think we could add an exception to the "Changes to underlying filesystems" section regarding "Offline changes, when the overlay is not mounted" that explicitly allows to append files to a data-only layer, even with new features enabled. Thanks, Amir.