On Wed, 17 Jul 2024 at 19:15, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > On Mon, Jul 15, 2024 at 9:14 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > > > On Mon, Jul 15, 2024, 6:36 PM Daire Byrne <daire@xxxxxxxx> wrote: > >> > >> On Mon, 15 Jul 2024 at 15:15, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > >> > > >> > > > I understand. > >> > > > It makes sense. > >> > > > > >> > > > I remember tossing the idea of "finalizing" the merged dir copy up - > >> > > > meaning that at the end of ovl_dir_read_merged(), overlayfs knows > >> > > > if the upper entries shadow all the lower entries, and in this case, the > >> > > > lower layers NEVER need to be iterated again, so some xattr could > >> > > > be set on the upper dir to indicate that the copy up on the dir content > >> > > > has been completed. > >> > > > > >> > > > After the copy up of dir content has been completed, then ovl_lookup() > >> > > > should not continue to lookup children of this merged dir in lower layers > >> > > > unless it was redirected by upper layer. > >> > > > > >> > > > It is not a trivial change, but I think it can be beneficial. > >> > > > > >> > > > The good thing about this is that there is no need for a new API - > >> > > > all your service would need to do is chown -R as you tried to do and > >> > > > it will "just work" - no more unneeded lookups in NFS layer. > >> > > > >> > > Well, that is an interesting idea. I'm not sure how you would > >> > > determine that a merged dir has been "completely" copied up (comparing > >> > > readdir results?). > >> > > >> > overlay readdir of merged dir NEEDS to merge lower entries > >> > that DO NOT exist in the upper layer - if there are not such entries > >> > found, looking in the lower layer next time is futile. > >> > > >> > > And how would this differ to setting the "opaque" > >> > > xattr on the dir (but automatically)? > >> > > >> > The lower layer still has information that overlayfs needs, > >> > and ovetrlayfs needs to be able to follow redirects into lower layer. > >> > This is not going to work with an opaque upper dir. > >> > >> I guess as long as the upperdir can now serve all the lookups and > >> negative lookups for a given directory (and optionally entire > >> subsequent directory tree) without needing to consult with the lower > >> directory specifically for them, that's all I care about :) > >> > >> > > Would it need a new xattr? > >> > > > >> > > >> > Maybe, or use the combination of "opaque" + "redirect" to > >> > describe this hybrid type of directory (the dir content was fully > >> > copied up, but redirects may still follow to lower entries. > >> > Essentially, this is equivalent to a lower-most directory (implicitly > >> > opaque dir) that can follow redirects into a data-only layer. > >> > > >> > > It also means that all subsequent dirs in the lower tree would also be > >> > > "opaque" even if they have not been checked for copy-up completeness? > >> > > >> > No. A directory inode is a sort of a file whose "data" is the dir content. > >> > "copy-up completeness" means the list of entries have been copied up > >> > (not recursively). > >> > > >> > > Or they would get a redirect until it could be determined they were > >> > > completely copied up? > >> > > >> > readdir operated on a single dir inode. > >> > readdir of a directory can end up making it "half-opaque" > >> > nothing recursive about it - application can do this recursively > >> > as it wishes. > >> > > >> > > > >> > > I also won't pretend to understand how you could do that for a > >> > > recursive copy up without momentarily disrupting access. Like if you > >> > > did a recursive copy up and the top level dirs complete first while > >> > > the lower contents haven't been totally copied up yet? > >> > > >> > Not doing anything recursive. > >> > >> I guess what I meant by recursive was the proposed "chown -R" that > >> would "promote" the metadata to the upper layer recursively. > >> > >> I think you answered my question by saying that both files & > >> directories in a "complete" copy-up directory would still get a > >> redirect so it wouldn't break access while the chown was running? Once > >> it gets to the next level, the new xatrr (or opaque + redirect) would > >> then be added to those directories etc etc. all the way down. > > > > > > Yap. > > > >> > >> > > > >> > > It sounds complex :) > >> > > >> > Not really. The patch is not trivial, but the concept is simple. > >> > If I find a few hours, I will post a demo. > >> > >> That would be cool! Always happy to test patches. > >> > >> > > > > > One more thing that could help said service is if overlayfs > >> > > > > > supported a hybrid mode of redirect_dir=follow,metacopy=on, > >> > > > > > where redirect is enabled for regular files for metacopy, but NOT > >> > > > > > enabled for directories (which was redirect_dir original use case). > >> > > > > > > >> > > > > > This way, the service could run the command line: > >> > > > > > $ mv /ovl/blah/thing /ovl/local > >> > > > > > then "mv" will get EXDEV for moving directories and will create > >> > > > > > opaque directories in their place and it will recursively move all > >> > > > > > the files to the opaque directories. > >> > > > > > >> > > > > Okay, I think I see what you are getting at but I need to test the > >> > > > > patch to make sure :) > >> > > > >> > > Sorry, I will try and test the patch this week as I am actually > >> > > curious about using it to create offline handcrafted overlay trees > >> > > too. So rather than run a combination of truncate, touch, chown, > >> > > chmod, setfattr commands, mount an overlay with your patch, move the > >> > > dirs around, umount and then use the resulting metadata overlay as a > >> > > read-only overlay from then on. > >> > > > >> > > >> > That sounds much better than mangling with overlayfs xattrs. > >> > > >> > > I'm still toying with the idea of creating one (enormous) read-only > >> > > overlay with all the lib/plugin directories as opaque directories and > >> > > just accepting that I might only refresh it once a day and clients > >> > > might only remount it once a week... Not great, but some amount of > >> > > local lookup acceleration is better than none. > >> > > > >> > > I think the main problem with using this patch for my use case is that > >> > > as soon as you do the mv, you break any processes that might be > >> > > scanning those dirs at that instant or any new ones that start up. It > >> > > may be possible to have my userspace daemon choose the right time to > >> > > run the mv, but it's hard to predict how fast it would take to > >> > > complete. > >> > > > >> > > >> > Confused. I thought you were going to use the patch for offline preparation > >> > of metacopy layers. > >> > >> Sorry, I did mean only for the case where I might create the desired > >> upper layer for reuse later on (ie offline changes), your patch sounds > >> like a really useful and optimised time saver compared to my > >> hand-crafted method. I am still considering the offline method if > >> there proves to be no other alternative. > >> > >> But for the case where I would want a seamless online way to achieve > >> the same upper layer opaque directories, then obviously moving > >> directory trees even momentarily out of position and back again would > >> likely break software just starting up in that moment. > >> > >> And coordinating a background daemon that does the mv, with users who > >> randomly start applications sounds like a difficult problem. > >> > >> > Note that once you did mv into an opaque tree, > >> > you can move the opaque dir back into its original location > >> > (e.g. /blah/think/UUID...) and the dir will remain opaque, > >> > because EXDEV is only generated when trying to move > >> > merged dirs. > >> > Moving opaque upper dirs around is allowed and should work. > >> > >> Yes exactly, this would likely work most of the time while online > >> except when some software is expecting the files to always be located > >> in an immutable path location and the mv is in progress? Unless I am > >> totally misunderstanding (always a strong possibility). > > > > > > You understood correctly. > > This method is not suitable for online promotion. > > > >> > >> Basically, I need to be able to continue serving the same files and > >> paths even while the copy-up metadata process for any part of the tree > >> is in progress. And it sounds like your idea of considering a copy-up > >> of a merged dir as "complete" (and essentially opaque) would be the > >> way to do that without files or dirs ever moving or losing access even > >> momentarily. > > > > > > Yes, that's the idea. > > > > I'll see when I get around to that demo. > > I found some time to write the POC patch, but not enough time > to make it work :) - it is failing some fstests. > > Since I don't know when I will have time to debug the issues, > here is the WIP if you want to debug it and point out the bugs: > > https://github.com/amir73il/linux/commits/ovl-finalize-dir/ This is very cool - many thanks! Unfortunately, I'm probably not the right person to code and identify actual fixes, but I can test and describe results pretty well. :) So I applied the patch (cleanly) to v6.9.3 (because I had it handy) and mounted with "metadata=on". The first oddity is that the root ovl directory shows no results for "ls /ovl" (there are lots of dirs in the lower layer) but if I do the same to a directory I know exists, it appears and returns results just fine (e.g. ls /ovl/thing/blah). Then if I "ls /ovl" again I see just /ovl/thing but none of the other dirs (until also accessed by path). Anyway, that doesn't really block further testing as the software I load does not need to walk or interrogate the entries. So then I did a "chown -h -R bob /blah/thing/stuff/version" and looked at the xattrs of the upper - all the (metadata) files and dirs were brought up with files having a redirect, but the dirs that should have trusted.overlay.opaque=z did not at this stage. Another followup "ls -lR /blah/thing/stuff/version" and now I can see the trusted.overlay.opaque=z where I would expect it to be. But now when I lookup random NOENT files in those directories, I can still see the lookup going across the network to the lower filesystem? It looks like it's the same for the positive lookups - doing a stat against a file that I know is in a trusted.overlay.opaque=z directory still sends the lookup over NFS (which it does not if the directory is opaque=y). I mean, I expect a lookup for an existing file with a metadata redirect to it for reads but not metadata stat() lookups? Also I would expect no lookups to the lower for negative lookups? Unless we can't serve negative lookups from the readdir of the upper dir? I have probably misunderstood that the "finalized" directories will only serve the contents of the readdir result and not send metadata lookups to the lower level (ala dir=opaque). Or my v6.9.3 kernel has some other issue unrelated to this patch.... Daire