On 7/12/24 04:09, Amir Goldstein wrote: > On Fri, Jul 12, 2024 at 6:24 AM Mike Baynton <mike@xxxxxxxxxxxx> wrote: >> >> On 7/11/24 18:30, Amir Goldstein wrote: >>> On Thu, Jul 11, 2024 at 6:59 PM Daire Byrne <daire@xxxxxxxx> wrote: >>>> Basically I have a read-only NFS filesystem with software releases >>>> that are versioned such that no files are ever overwritten or changed. >>>> New uniquely named directory trees and files are added from time to >>>> time and older ones are cleaned up. >>>> >>> >>> Sounds like a common use case that many people are interested in. >> >> I can vouch that that's accurate, I'm doing nearly the same thing. The >> properties of the NFS filesystem in terms of what is and is not expected >> to change is identical for me, though my approach to incorporating >> overlayfs has been a little different. >> >> My confidence in the reliability of what I'm doing is still far from >> absolute, so I will be interested in efforts to validate/officially >> sanction/support/document related techniques. >> >> The way I am doing it is with NFS as a data-only layer. Basically my use >> case calls for presenting different views of NFS-backed data (it's >> software libraries) to different applications. No application wants or >> needs to have the entire NFS tree exposed to it, but each application >> wants to use some data available on NFS and wants it to be presented in >> some particular local place. So I actually wanted a method where I >> author a metadata-only layer external to overlayfs, built to spec. >> >> Essentially it's making overlayfs redirects be my symlinks so that code >> which doesn't follow symlinks or is otherwise influenced by them is none >> the wiser. >> > > Nice. > I've always wished that data-only would not be an "offline-only" feature, > but getting the official API for that scheme right might be a challenge. > >>>> My first question is how bad can the "undefined behaviour" be in this >>>> kind of setup? >>> >>> The behavior is "undefined" because nobody tried to define it, >>> document it and test it. >>> I don't think it would be that "bad", but it will be unpredictable >>> and is not very nice for a software product. >>> >>> One of the current problems is that overlayfs uses readdir cache >>> the readdir cache is not auto invalidated when lower dir changes >>> so whether or not new subdirs are observed in overlay depends >>> on whether the merged overlay directory is kept in cache or not. >>> >> >> My approach doesn't support adding new files from the data-only NFS >> layer after the overlayfs is created, of course, since the metadata-only >> layer is itself the first lower layer and so would presumably get into >> undefined-land if added to. But this arrangement does probably >> mitigate this problem. Creating metadata inodes of a fixed set of >> libraries for a specific application is cheap enough (and considerably >> faster than copying it all locally) that the immutablity limitation >> works for me. >> > > Assuming that this "effectively-data-only" NFS layer is never iterated via > overlayfs then adding new unreferenced objects to this layer should not > be a problem either. > >>>> Any files that get copied up to the upper layer are >>>> guaranteed to never change in the lower NFS filesystem (by it's >>>> design), but new directories and files that have not yet been copied >>>> up, can randomly appear over time. Deletions are not so important >>>> because if it has been deleted in the lower level, then the upper >>>> level copy failing has similar results (but we should cleanup the >>>> upper layer too). >>>> >>>> If it's possible to get over this first difficult hurdle, then I have >>>> another extra bit of complexity to throw on top - now manually make an >>>> entire directory tree (of metdata) that we have recursively copied up >>>> "opaque" in the upper layer (currently needs to be done outside of >>>> overlayfs). Over time or dropping of caches, I have found that this >>>> (seamlessly?) takes effect for new lookups. >>>> >>>> I also noticed that in the current implementation, this "opaque" >>>> transition actual breaks access to the file because the metadata >>>> copy-up sets "trusted.overlay.metacopy" but does not currently add an >>>> explicit "trusted.overlay.redirect" to the correspnding lower layer >>>> file. But if it did (or we do it manually with setfattr), then it is >>>> possible to have an upper level directory that is opaque, contains >>>> file metadata only and redirects to the data to the real files on the >>>> lower NFS filesystem. >> >> So once you use opaque dirs and redirects on an upper layer, it's >> sounding very similar to redirects into a data-only layer. In either >> case you're responsible for producing metadata inodes for each NFS file >> you want presented to the application/user. >> > > Yes, it is almost the same as data-only layer. > The only difference is that real data-only layer can never be accessed > directly from overlay, while the effectively-data-only layer must have > some path (e.g /blobs) accessible directly from overlay in order to do > online rename of blobs into the upper opaque layer. > >> This way seems interesting and more promising for adding NFS-backed >> files "online" though. >> >>> how can we document it to make the behavior "defined"? >>> >>> My thinking is: >>> >>> "Changes to the underlying filesystems while part of a mounted overlay >>> filesystem are not allowed. If the underlying filesystem is changed, >>> the behavior of the overlay is undefined, though it will not result in >>> a crash or deadlock. >>> >>> One exception to this rule is changes to underlying filesystem objects >>> that were not accessed by a overlayfs prior to the change. >>> In other words, once accessed from a mounted overlay filesystem, >>> changes to the underlying filesystem objects are not allowed." >>> >>> But this claim needs to be proved and tested (write tests), >>> before the documentation defines this behavior. >>> I am not even sure if the claim is correct. >> >> I've been blissfully and naively assuming that it is based on intuition >> :). > > Yes, what overlay did not observe, overlay cannot know about. > But the devil is in the details, such as what is an "accessed > filesystem object". > > In our case study, we refer to the newly added directory entries > and new inodes "never accessed by overlayfs", so it sounds > safe to add them while overlayfs is mounted, but their parent directory, > even if never iterated via overlayfs was indeed accessed by overlayfs > (when looking up for existing siblings), so overlayfs did access > the lower parent directory and it does reference the lower parent > directory dentry/inode, so it is still not "intuitively" safe to change it. > >> >> I think Daire and I are basically only adding new files to the NFS >> filesystem, and both the all-opaque approach and the data-only approach >> could prevent accidental access to things on the NFS filesystem through >> the overlayfs (or at least portion of it meant for end-user consumption) >> while they are still being birthed and might be experiencing changes. >> At some point in the NFS tree, directories must be modified, but since >> both approaches have overlayfs sourcing all directory entries from local >> metadata-only layers, it seems plausible that the directories that >> change aren't really "accessed by a overlayfs prior to the change." >> >> How much proving/testing would you want to see before documenting this >> and supporting someone in future who finds a way to prove the claim >> wrong? >> > > *very* good question :) > > For testing, an xfstest will do - you can fork one of the existing > data-only tests as a template. Due to the extended delay in a substantive response, I just wanted to send a quick thank you for your reply and suggestions here. I am still interested in pursuing this, but I have been busy and then recovering from illness. I'll need to study how xfstest directly exercises overlayfs and how it is combined with unionmount-testsuite I think. > > For documentation, I think it is too hard to commit to the general > statement above. > > Try to narrow the exception to the rule to the very specific use case > of "append-only" instead of "immutable" lower directory and then > state that the behavior is "defined" - the new entries are either visible > by overlayfs or they are not visible, and the "undefined" element > is *when* they become visible and via which API (*). > > (*) New entries may be visible to lookup and invisible to readdir > due to overlayfs readdir cache, and entries could be visible to > readdir and invisible to lookup, due to vfs negative lookup cache. > > Note that the behavior of POSIX readdir() for entries added while > an open dir fd is being iterated is similar - the new entries will either > be visible in the iteration of that fd or they won't be, but there is > a clear "barrier" when the new entries will become visible > (on seek to start or open of new fd)> >>> >>> One more thing that could help said service is if overlayfs >>> supported a hybrid mode of redirect_dir=follow,metacopy=on, >>> where redirect is enabled for regular files for metacopy, but NOT >>> enabled for directories (which was redirect_dir original use case). >>> >>> This way, the service could run the command line: >>> $ mv /ovl/blah/thing /ovl/local >>> then "mv" will get EXDEV for moving directories and will create >>> opaque directories in their place and it will recursively move all >>> the files to the opaque directories. >> >> Clever. > > Feel free to post this patch if you find it useful. > The commit message should say that the mount option > check does not reflect the actual dependency in the code, > and it should also explain very well why this mount option combination > is desired and lore Link: to this conversation. > > Thanks, > Amir.