Re: overlayfs: NFS lowerdir changes & opaque negative lookups

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 23 Jul 2024 at 22:31, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> On Mon, Jul 22, 2024 at 6:28 PM Daire Byrne <daire@xxxxxxxx> wrote:
> >
> > On Fri, 19 Jul 2024 at 10:19, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > >
> > > > > >> Basically, I need to be able to continue serving the same files and
> > > > > >> paths even while the copy-up metadata process for any part of the tree
> > > > > >> is in progress. And it sounds like your idea of considering a copy-up
> > > > > >> of a merged dir as "complete" (and essentially opaque) would be the
> > > > > >> way to do that without files or dirs ever moving or losing access even
> > > > > >> momentarily.
> > > > > >
> > > > > >
> > > > > > Yes, that's the idea.
> > > > > >
> > > > > > I'll see when I get around to that demo.
> > > > >
> > > > > I found some time to write the POC patch, but not enough time
> > > > > to make it work :) - it is failing some fstests.
> > > > >
> > > > > Since I don't know when I will have time to debug the issues,
> > > > > here is the WIP if you want to debug it and point out the bugs:
> > > > >
> > > > > https://github.com/amir73il/linux/commits/ovl-finalize-dir/
> > > >
> > > > This is very cool - many thanks!
> > > >
> > >
> > > Pushed two fixes to this branch - now it passes fstest - please re-test.
> >
> > I can confirm the root dentry issue is fixed and I now see a lot less
> > ENOENTs hitting the lower (NFS) directory. For my test case, we go
> > from ~80,000 -> 27,000 ENOENTs, a great improvement.
> >
> > However I was still seeing some lookups hitting the lower layer in
> > directories that had opaque=z. After a bit of testing and much head
> > scratching, it seems like you need to do something like "ls
> > /olv/blah/thing/version/lib" to force a readdir of the lower NFS
> > filesystem (after remount) to reliably stop the lookups hitting the
> > lower filesystem.
> >
> > For example, after doing the steps to copy up metadata and set
> > opaque=z, we then remount and test the effect:
> >
> > chown -R -h bob /olv/blah/thing/version/lib
> > ls -lR  /ovl/blah/thing/version/lib
> > umount /ovl
> > mount /ovl
> > for x in {1..10}; do stat /olv/blah/thing/version/lib/${RANDOM};done
> >
> > This will always send the lookup to the lower level. But if we do:
> >
> > ls -l /olv/blah/thing/version/lib
> > for x in {1..10}; do stat /olv/blah/thing/version/lib/${RANDOM}.so;done
> >
> > this triggers a readdir of the lower NFS dir and then the random file
> > lookups only hit the upper layer as expected.
> >
> > So then I'm not entirely clear why in my real world test *most* of the
> > negative lookups are served by the upper level but some are not. I can
> > only think that loading software *mostly* triggers READDIRs on things
> > like lib dirs, but not always?
> >
> > It may also be complicated by the fact that we often have symlinks at
> > various points in the path too (both dir & file).
> >
>
> No again, it's just a bug.
> My fix to initialize __lastmerged for root dentry broke initialization
> of __lastmerged
> for lookup of non-root dentry and readdir would fixup  __lastmerged later.
>
> Pushed a fix. Again, only sanity tested, so the finalize feature may
> still have bugs.

Yep, that totally fixed the issue - after a remount, the directories
are still "finalized" and negative lookups are not hitting the lower
NFS filesystem.

So I have been doing some basic benchmarks to figure out how best to
make the case for this patch being included upstream. I have also been
looking at fscache too as it's a good companion to the idea of
"caching" or "localising" both metadata and data for a remote NFS
filesystem full of software releases.

I took one of our more complicated applications with ~200 various
paths (LD_LIBRARY, PYTHONPATH, PLUGINS, etc) and recorded the
(tcpdump) packet capture and launch times:

vanilla NFS - total packets=475531, NOENTs=69456, OPENs=61146
fscache NFS - total packets=200120, NOENTs=69460, OPENs=61045
fscache+overlay - total packets=67157, NOENTs=45, OPENs=6412

The time to start the software went from 58s (vanilla NFS) to 32s
(fscache+overlay). In both the fscache and overlay case, this
measurement was taken after a remount where the fscache reads had been
populated (so 0 READ calls over the network) and the lib directories
had been copied up and finalised.

I have also had pretty good success using a systemtap script to
monitor for the NOENT lookups to lib dirs and pass those to a
userspace daemon which then initiates the "chown -R -h bob
/ovl/blah/thing/lib/" commands. I might try with bpftrace too to
contrast and compare.

I was a little surprised to see that the copy-up was also helping with
reducing OPEN and LOOKUP calls hitting the remote filesystem, but I'll
take that bonus improvement too. If we just take the fscache ->
fscache+overlay case, there is an almost 4x reduction in packet counts
(roundtrips) with the metacopy overlay. If we look at the improvement
from the vanilla NFS mount we currently use, it's a 7x reduction in
packets between client and NFS server.

> > > ovl will lookup the lower data of a metacopy file at lookup time.
> > > Perhaps we could do lazy lookup of data files, like we do with
> > > data-only layers, but this would be more complicated and in any
> > > case, stat(2) needs the lower data file to present the correct
> > > value of st_blocks.
> >
> > Right... for some reason I thought it only needed st_size which it
> > could get from the sparse file.
> >
> > > IOW, this patch mainly helps to prevent *negative* lookups
> > > in the lower layers.
> > >
> > > > expect no lookups to the lower for negative lookups? Unless we can't
> > > > serve negative lookups from the readdir of the upper dir?
> > >
> > > Correct expectation.
> > > The bug was fixed. It should hopefully work now,
> > > although I did not test.
> > >
> > > >
> > > > I have probably misunderstood that the "finalized" directories will
> > > > only serve the contents of the readdir result and not send metadata
> > > > lookups to the lower level (ala dir=opaque). Or my v6.9.3 kernel has
> > > > some other issue unrelated to this patch....
> > >
> > > You understood almost correctly, but you need to understand that
> > > the finalized directory is not completely opaque, it is partly opaque.
> > > You can visualize this as painting the space "between entries" opaque,
> > > so that negative lookup results will stop as well as readdir, but ovl
> > > will still look *underneath* entries.
> > >
> > > BTW, ovl will also lookup in lower dirs if entries inside an upper
> > > opaque dir have an absolute path redirect xattr.
> > > This type of upper opaque dir is also called "impure" and has an
> > > "impure" xattr.
> > >
> > > So you may say that the difference between an opaque=y
> > > dir and opaque=z dir is that in the latter, all entries are treated
> > > as if they have an absolute path redirect (to their own path),
> > > but that is a very hand wavy way of putting it.
> >
> > Yea, that makes sense to me.
> >
> > > Anyway, if you are happy with the patch and want to see it upstreamed,
> > > I have few requests:
> > >
> > > 1. Please provide your Tested-by.
> > > 2. Please provide a detailed cover letter that explains the problem (*)
> > >     and how the patch fixes it, which includes performance gain numbers.
> > >     You may post the cover letter with my patch or ask me to do it
> >
> > I'll draft up something for review and I'll try to keep it concise.

I'll try to get to this next week.

> >
> > > 3. Please write an fstest to test the finalized dir feature (**)
> > >     You may want to look at test overlay/031 for inspiration
> > >     if also makes changes to layers offline and examines the outcome
> >
> > Bear with me... I'll have a look and see if I can figure it out.
> >
>
> I am not in a hurry. quite the contrary.
> I have just a bit of time to help :)
>
> > > (*) Please do NOT include the story of changing lower entries
> > >     under a mounted overlayfs - they are not relevant to this patch
> > >     and they are very controversial.
> >
> > Understood. However, I do suspect that some are using this "undefined
> > behaviour" because it worked for their use case. This patch might ruin
> > their day?
> >
>
> That's a good question - I don't know.
> We can either try to see if someone shouts or add it as an opt-in feature.
> I kind of dislike the idea of adding an opt-in option for this behavior,
> which is explicitly documented as undefined and could lead to several
> other weird issues, so I hope we can get away without opt-in.
>
> Even if we agreed to extend the defined behavior to changes to lower
> layers to objects which overlayfs did not yet observe, this definition will
> not have included the case where the merged directory was already iterated.

Maybe if it's enabled by default but at least has an option to disabled it?

But you really know best when it comes to changes like this.

Many thanks for your help and support with this. I'm truly grateful!

Now I need to finesse my userspace daemon for doing the copy-up and
try this out with production workloads (we have 5000+ batch hosts
running up to 6 applications at the same time).

Daire





[Index of Archives]     [Linux Filesystems Devel]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux