Re: OPEN_XOR_DELEGATION performance problems

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 19 Nov 2024 11:37:53 -0500

On Tue, 2024-11-19 at 16:23 +0000, Chuck Lever III wrote:
> 
> > On Nov 19, 2024, at 10:09 AM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Tue, 2024-11-19 at 06:45 -0500, Jeff Layton wrote:
> > > We attempted to implement the "delstid" draft for v6.13, but have had
> > > to drop the patches for it. After merge, we got a couple of reports
> > > of
> > > a performance issue due to the OPEN_XOR_DELEGATION patch:
> > > 
> > > 
> > > https://lore.kernel.org/linux-nfs/202409161645.d44bced5-oliver.sang@xxxxxxxxx/
> > > 
> > > Once we enable OPEN_XOR_DELEGATION support, the fsmark "App Overhead"
> > > statistic spikes significantly. The kernel patch for this is very
> > > simple, and doesn't seem likely to cause a performance issue on its
> > > own. My theory is that this test is one that causes the client to
> > > return the delegation, and since it doesn't have an open stateid, it
> > > has to reestablish one during the test run, and that causes the app
> > > overhead stat to spike.
> > > 
> > > Trond, Tom, Mike -- I know that the HS Anvil has support for
> > > OPEN_XOR_DELEGATION. If you run the fsmark test against it with that
> > > support both enabled and disabled (either on the client or server
> > > side), do you see a similar spike in "App Overhead"?
> > > 
> > > If so, then I suspect we need to consider limiting the use of that
> > > flag
> > > in some cases. I have no idea what heuristic we'd use to decide this
> > > though.
> > 
> > As already stated when we discussed this at Bakeathon: the server is
> > still in charge of heuristics w.r.t. whether or not there may be
> > contention for the file. The OPEN_XOR_DELEGATION flag changes nothing
> > in that respect.
> 
> fsmark is a single-client test. There should be no contention
> for any files during this test.
> 
> 
> > Yes, I'm sure you can find tests which cause recalls of delegations,
> > and those will be marginally slower when the client has to re-establish
> > an open stateid.
> 
> The fsmark result regressed 92%.
> 

To be clear, the fsmark "App Overhead" regressed 92%. Which has a
curious definition: "App overhead is time in microseconds spent in the
test not doing file writing related system calls."

It encompasses a bunch of different test setup stuff, and it's been
difficult to nail down the part that is slower. See Oliver's email
here:

https://lore.kernel.org/linux-nfs/ZwTm4e5JxOOJc7JC@xsang-OptiPlex-9020/

-- 
Jeff Layton <jlayton@xxxxxxxxxx>