Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 8 Jan 2025 17:27:59 -0500

On 1/8/25 3:41 PM, NeilBrown wrote:
On Thu, 09 Jan 2025, Chuck Lever wrote:
On 1/7/25 6:01 PM, NeilBrown wrote:
On Tue, 07 Jan 2025, Chuck Lever wrote:
On 1/5/25 10:02 PM, NeilBrown wrote:
On Mon, 06 Jan 2025, Chuck Lever wrote:
On 1/5/25 6:11 PM, NeilBrown wrote:

+		unsigned long num_to_scan = min(cnt, 1024UL);

I see long delays with fewer than 1024 items on the list. I might
drop this number by one or two orders of magnitude. And make it a
symbolic constant.

In that case I seriously wonder if this is where the delays are coming
from.

nfsd_file_dispose_list_delayed() does take and drop a spinlock
repeatedly (though it may not always be the same lock) and call
svc_wake_up() repeatedly - although the head of the queue might already
be woken.  We could optimise that to detect runs with the same nn and
only take the lock once, and only wake_up once.

There's another naked integer (8) in nfsd_file_net_dispose() -- how does
that relate to this new cap? Should that also be a symbolic constant?

I don't think they relate.
The trade-off with "8" is:
     a bigger number might block an nfsd thread for longer,
       forcing serialising when the work can usefully be done in parallel.
     a smaller number might needlessly wake lots of threads
       to share out a tiny amount of work.

The 1024 is simply about "don't hold a spinlock for too long".

By that, I think you mean list_lru_walk() takes &l->lock for the
duration of the scan? For a long scan, that would effectively block
adding or removing LRU items for quite some time.

So here's a typical excerpt from a common test:

kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...

kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309
entries removed, 2972 remaining

nfsd-1532  [015]   266.988626: nfsd_file_free: ...

Here, the nfsd_file_unhash record marks the beginning of the LRU
walk, and the nfsd_file_gc_removed record marks the end. The
timestamps indicate the walk took two milliseconds.

The nfsd_file_free record above marks the last disposal activity.
That takes almost a millisecond, but as far as I can tell, it
does not hold any locks for long.

This seems to me like a strong argument for cutting the scan size
down to no more than 32-64 items. Ideally spin locks are supposed
to be held only for simple operations (eg, list_add); this seems a
little outside that window (hence your remark that "a large
nr_to_walk is always a bad idea" -- I now see what you meant).

This is useful - thanks.
So the problem seems to be that holding the list_lru while canning the
whole list can block all incoming NFSv3 for a noticeable amount of time
- 2 msecs above.  That makes perfect sense and as you say it suggests
that the lack of scheduling points isn't really the issue.

This confirms for me that the list_lru approach is no a good fit for
this problem.  I have written a patch which replaces it with a pair of
simple lists as I described in my cover letter.

Before proceeding with replacement of the LRU, is there interest in
addressing this issue in LTS kernels as well? If so, then IMO the
better approach would be to take a variant of your narrower fix for
v6.14, and then visit the deeper LRU changes for v6.15ff.

That is probably reasonable.  You could take the first patch, drop the
1024 to 64 (or less if testing suggests that is still too high), and
maybe drop he cond_resched().

I will make it so. Enjoy the rest of your leave!

--
Chuck Lever