Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 6 Jan 2025 08:57:50 -0500

On 1/5/25 10:02 PM, NeilBrown wrote:
On Mon, 06 Jan 2025, Chuck Lever wrote:
On 1/5/25 6:11 PM, NeilBrown wrote:

+		unsigned long num_to_scan = min(cnt, 1024UL);

I see long delays with fewer than 1024 items on the list. I might
drop this number by one or two orders of magnitude. And make it a
symbolic constant.

In that case I seriously wonder if this is where the delays are coming
from.

nfsd_file_dispose_list_delayed() does take and drop a spinlock
repeatedly (though it may not always be the same lock) and call
svc_wake_up() repeatedly - although the head of the queue might already
be woken.  We could optimise that to detect runs with the same nn and
only take the lock once, and only wake_up once.

There's another naked integer (8) in nfsd_file_net_dispose() -- how does
that relate to this new cap? Should that also be a symbolic constant?

I don't think they relate.
The trade-off with "8" is:
   a bigger number might block an nfsd thread for longer,
     forcing serialising when the work can usefully be done in parallel.
   a smaller number might needlessly wake lots of threads
     to share out a tiny amount of work.

The 1024 is simply about "don't hold a spinlock for too long".

By that, I think you mean list_lru_walk() takes &l->lock for the
duration of the scan? For a long scan, that would effectively block
adding or removing LRU items for quite some time.

So here's a typical excerpt from a common test:

kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...

kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309 
entries removed, 2972 remaining

nfsd-1532  [015]   266.988626: nfsd_file_free: ...

Here, the nfsd_file_unhash record marks the beginning of the LRU
walk, and the nfsd_file_gc_removed record marks the end. The
timestamps indicate the walk took two milliseconds.

The nfsd_file_free record above marks the last disposal activity.
That takes almost a millisecond, but as far as I can tell, it
does not hold any locks for long.

This seems to me like a strong argument for cutting the scan size
down to no more than 32-64 items. Ideally spin locks are supposed
to be held only for simple operations (eg, list_add); this seems a
little outside that window (hence your remark that "a large
nr_to_walk is always a bad idea" -- I now see what you meant).

IMHO the patch description should single out that purpose: We want to
significantly reduce the maximum amount of time that list_lru_walk()
blocks foreground LRU activity such as list_lru_add_obj().

The cond_resched() in this case might be gravy.

+		ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
+				    &dispose, num_to_scan);
+		trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
+		nfsd_file_dispose_list_delayed(&dispose);

I need to go back and review the function traces to see where the
delays add up -- to make sure rescheduling here, rather than at some
other point, is appropriate. It probably is, but my memory fails me
these days.

I would like to see those function traces too.

Here's my reproducer:

1. On the client: Set up xfstests to use NFSv3
2. On the server: "sudo trace-cmd record -e nfsd &"
3. On the client: Run xfstests with "sudo ./check -nfs generic/750"

--
Chuck Lever