Re: filecache LRU performance regression

Frank van der Linden <fllinden@xxxxxxxxxx> · Wed, 1 Jun 2022 21:18:27 +0000

On Wed, Jun 01, 2022 at 04:37:47PM +0000, Chuck Lever III wrote:
> 
> > On Jun 1, 2022, at 12:10 PM, Frank van der Linden <fllinden@xxxxxxxxxx> wrote:
> >
> > On Wed, Jun 01, 2022 at 12:34:34AM +0000, Chuck Lever III wrote:
> >>> On May 27, 2022, at 5:34 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>>> On May 27, 2022, at 4:37 PM, Frank van der Linden <fllinden@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
> >>>>>
> >>>>>
> >>>>> Hi Frank-
> >>>>>
> >>>>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
> >>>>> Do you have a reproducer I can try?
> >>>>
> >>>> Hi Chuck,
> >>>>
> >>>> The easiest way to reproduce the issue is to run generic/531 over an
> >>>> NFSv4 mount, using a system with a larger number of CPUs on the client
> >>>> side (or just scaling the test up manually - it has a calculation based
> >>>> on the number of CPUs).
> >>>>
> >>>> The test will take a long time to finish. I initially described the
> >>>> details here:
> >>>>
> >>>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> >>>>
> >>>> Since then, it was also reported here:
> >>>>
> >>>> https://lore.kernel.org/all/20210531125948.2D37.409509F4@xxxxxxxxxxxx/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
> >>>
> >>> Thanks for the summary. So, there isn't a bugzilla tracking this
> >>> issue? If not, please create one here:
> >>>
> >>> https://bugzilla.linux-nfs.org/
> >>>
> >>> Then we don't have to keep asking for a repeat summary ;-)
> >>
> >> I can easily reproduce this scenario in my lab. I've opened:
> >>
> >>  https://bugzilla.linux-nfs.org/show_bug.cgi?id=386
> >>
> >
> > Thanks for taking care of that. I'm switching jobs, so I won't have much
> > time to look at it or test for a few weeks.
> 
> No problem. I can reproduce the failure, and I have some ideas
> of how to address the issue, so I've assigned the bug to myself.
> 
> 
> > I think the basic problem is that the filecache is a clear win for
> > v3, since that's stateless and it avoids a lookup for each operation.
> >
> > For v4, it's not clear to me that it's much of a win, and in this case
> > it definitely gets in the way.
> >
> > Maybe the best thing is to not bother at all with the caching for v4,
> 
> At this point I don't think we can go that way. The NFSv4 code
> uses a lot of the same infrastructural helpers as NFSv3, and
> all of those now depend on the use of nfsd_file objects.
> 
> Certainly, though, the filecache plays somewhat different roles
> for legacy NFS and NFSv4. I've been toying with the idea of
> maintaining separate filecaches for NFSv3 and NFSv4, since
> the garbage collection and shrinker rules are fundamentally
> different for the two, and NFSv4 wants a file closed completely
> (no lingering open) when it does a CLOSE or DELEGRETURN.
> 
> In the meantime, the obvious culprit is the LRU walk during
> garbage collection is broken. I've talked with Dave Chinner,
> co-author of list_lru, about a way to straighten this out so
> that the LRU walk is very nicely bounded and at the same time
> deals properly with NFSv4 OPEN and CLOSE. Trond also had an
> idea or two here, and it seems the three of us are on nearly
> the same page.
> 
> Once that is addressed, we can revisit Wang's suggestion of
> serializing garbage collection, as a nice optimization.

Sounds good, thanks!

A related issue: there is currently no upper limit that I can see
on the number of active OPENs for a client. So essentially, a
client can run a server out of resources by doing a very large
number of OPENs.

Should there be an upper limit, above which requests are either
denied, or old state is invalidated?

- Frank