On Fri, 19 Jul 2024 at 10:19, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > >> Basically, I need to be able to continue serving the same files and > > > >> paths even while the copy-up metadata process for any part of the tree > > > >> is in progress. And it sounds like your idea of considering a copy-up > > > >> of a merged dir as "complete" (and essentially opaque) would be the > > > >> way to do that without files or dirs ever moving or losing access even > > > >> momentarily. > > > > > > > > > > > > Yes, that's the idea. > > > > > > > > I'll see when I get around to that demo. > > > > > > I found some time to write the POC patch, but not enough time > > > to make it work :) - it is failing some fstests. > > > > > > Since I don't know when I will have time to debug the issues, > > > here is the WIP if you want to debug it and point out the bugs: > > > > > > https://github.com/amir73il/linux/commits/ovl-finalize-dir/ > > > > This is very cool - many thanks! > > > > Pushed two fixes to this branch - now it passes fstest - please re-test. I can confirm the root dentry issue is fixed and I now see a lot less ENOENTs hitting the lower (NFS) directory. For my test case, we go from ~80,000 -> 27,000 ENOENTs, a great improvement. However I was still seeing some lookups hitting the lower layer in directories that had opaque=z. After a bit of testing and much head scratching, it seems like you need to do something like "ls /olv/blah/thing/version/lib" to force a readdir of the lower NFS filesystem (after remount) to reliably stop the lookups hitting the lower filesystem. For example, after doing the steps to copy up metadata and set opaque=z, we then remount and test the effect: chown -R -h bob /olv/blah/thing/version/lib ls -lR /ovl/blah/thing/version/lib umount /ovl mount /ovl for x in {1..10}; do stat /olv/blah/thing/version/lib/${RANDOM};done This will always send the lookup to the lower level. But if we do: ls -l /olv/blah/thing/version/lib for x in {1..10}; do stat /olv/blah/thing/version/lib/${RANDOM}.so;done this triggers a readdir of the lower NFS dir and then the random file lookups only hit the upper layer as expected. So then I'm not entirely clear why in my real world test *most* of the negative lookups are served by the upper level but some are not. I can only think that loading software *mostly* triggers READDIRs on things like lib dirs, but not always? It may also be complicated by the fact that we often have symlinks at various points in the path too (both dir & file). > ovl will lookup the lower data of a metacopy file at lookup time. > Perhaps we could do lazy lookup of data files, like we do with > data-only layers, but this would be more complicated and in any > case, stat(2) needs the lower data file to present the correct > value of st_blocks. Right... for some reason I thought it only needed st_size which it could get from the sparse file. > IOW, this patch mainly helps to prevent *negative* lookups > in the lower layers. > > > expect no lookups to the lower for negative lookups? Unless we can't > > serve negative lookups from the readdir of the upper dir? > > Correct expectation. > The bug was fixed. It should hopefully work now, > although I did not test. > > > > > I have probably misunderstood that the "finalized" directories will > > only serve the contents of the readdir result and not send metadata > > lookups to the lower level (ala dir=opaque). Or my v6.9.3 kernel has > > some other issue unrelated to this patch.... > > You understood almost correctly, but you need to understand that > the finalized directory is not completely opaque, it is partly opaque. > You can visualize this as painting the space "between entries" opaque, > so that negative lookup results will stop as well as readdir, but ovl > will still look *underneath* entries. > > BTW, ovl will also lookup in lower dirs if entries inside an upper > opaque dir have an absolute path redirect xattr. > This type of upper opaque dir is also called "impure" and has an > "impure" xattr. > > So you may say that the difference between an opaque=y > dir and opaque=z dir is that in the latter, all entries are treated > as if they have an absolute path redirect (to their own path), > but that is a very hand wavy way of putting it. Yea, that makes sense to me. > Anyway, if you are happy with the patch and want to see it upstreamed, > I have few requests: > > 1. Please provide your Tested-by. > 2. Please provide a detailed cover letter that explains the problem (*) > and how the patch fixes it, which includes performance gain numbers. > You may post the cover letter with my patch or ask me to do it I'll draft up something for review and I'll try to keep it concise. > 3. Please write an fstest to test the finalized dir feature (**) > You may want to look at test overlay/031 for inspiration > if also makes changes to layers offline and examines the outcome Bear with me... I'll have a look and see if I can figure it out. > (*) Please do NOT include the story of changing lower entries > under a mounted overlayfs - they are not relevant to this patch > and they are very controversial. Understood. However, I do suspect that some are using this "undefined behaviour" because it worked for their use case. This patch might ruin their day? > (**) Basically do chown -R and ls, unmount overlay, add lower entry > re-mount overlay and verify that it is not observed. > Can also sanity check that opaque=z was set. > Note that adding lower entries offline is a behavior that was once allowed > and some users actually expect it to work, so I made the feature > depend on !ovl_allow_offline_changes(ofs). > Therefore, the test should explicitly require and enable metacopy for > overlayfs mount to enable the finalized dir feature. Many thanks, Daire