Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()

"NeilBrown" <neilb@xxxxxxx> · Wed, 08 Jan 2025 10:01:58 +1100

On Tue, 07 Jan 2025, Chuck Lever wrote:
> On 1/5/25 10:02 PM, NeilBrown wrote:
> > On Mon, 06 Jan 2025, Chuck Lever wrote:
> >> On 1/5/25 6:11 PM, NeilBrown wrote:
> 
> >>> +		unsigned long num_to_scan = min(cnt, 1024UL);
> >>
> >> I see long delays with fewer than 1024 items on the list. I might
> >> drop this number by one or two orders of magnitude. And make it a
> >> symbolic constant.
> > 
> > In that case I seriously wonder if this is where the delays are coming
> > from.
> > 
> > nfsd_file_dispose_list_delayed() does take and drop a spinlock
> > repeatedly (though it may not always be the same lock) and call
> > svc_wake_up() repeatedly - although the head of the queue might already
> > be woken.  We could optimise that to detect runs with the same nn and
> > only take the lock once, and only wake_up once.
> > 
> >>
> >> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
> >> that relate to this new cap? Should that also be a symbolic constant?
> > 
> > I don't think they relate.
> > The trade-off with "8" is:
> >    a bigger number might block an nfsd thread for longer,
> >      forcing serialising when the work can usefully be done in parallel.
> >    a smaller number might needlessly wake lots of threads
> >      to share out a tiny amount of work.
> > 
> > The 1024 is simply about "don't hold a spinlock for too long".
> 
> By that, I think you mean list_lru_walk() takes &l->lock for the
> duration of the scan? For a long scan, that would effectively block
> adding or removing LRU items for quite some time.
> 
> So here's a typical excerpt from a common test:
> 
> kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...
> 
> kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309 
> entries removed, 2972 remaining
> 
> nfsd-1532  [015]   266.988626: nfsd_file_free: ...
> 
> Here, the nfsd_file_unhash record marks the beginning of the LRU
> walk, and the nfsd_file_gc_removed record marks the end. The
> timestamps indicate the walk took two milliseconds.
> 
> The nfsd_file_free record above marks the last disposal activity.
> That takes almost a millisecond, but as far as I can tell, it
> does not hold any locks for long.
> 
> This seems to me like a strong argument for cutting the scan size
> down to no more than 32-64 items. Ideally spin locks are supposed
> to be held only for simple operations (eg, list_add); this seems a
> little outside that window (hence your remark that "a large
> nr_to_walk is always a bad idea" -- I now see what you meant).

This is useful - thanks.
So the problem seems to be that holding the list_lru while canning the
whole list can block all incoming NFSv3 for a noticeable amount of time
- 2 msecs above.  That makes perfect sense and as you say it suggests
that the lack of scheduling points isn't really the issue.

This confirms for me that the list_lru approach is no a good fit for
this problem.  I have written a patch which replaces it with a pair of
simple lists as I described in my cover letter.

It is a bit large and needs careful review.  In particular I haven't
given thought to the tracepoints which might need to be moved or changed
or discarded.

Thanks,
NeilBrown