On 1/5/25 10:02 PM, NeilBrown wrote:
On Mon, 06 Jan 2025, Chuck Lever wrote:
On 1/5/25 6:11 PM, NeilBrown wrote:
+ unsigned long num_to_scan = min(cnt, 1024UL);
I see long delays with fewer than 1024 items on the list. I might
drop this number by one or two orders of magnitude. And make it a
symbolic constant.
In that case I seriously wonder if this is where the delays are coming
from.
nfsd_file_dispose_list_delayed() does take and drop a spinlock
repeatedly (though it may not always be the same lock) and call
svc_wake_up() repeatedly - although the head of the queue might already
be woken. We could optimise that to detect runs with the same nn and
only take the lock once, and only wake_up once.
There's another naked integer (8) in nfsd_file_net_dispose() -- how does
that relate to this new cap? Should that also be a symbolic constant?
I don't think they relate.
The trade-off with "8" is:
a bigger number might block an nfsd thread for longer,
forcing serialising when the work can usefully be done in parallel.
a smaller number might needlessly wake lots of threads
to share out a tiny amount of work.
The 1024 is simply about "don't hold a spinlock for too long".
By that, I think you mean list_lru_walk() takes &l->lock for the
duration of the scan? For a long scan, that would effectively block
adding or removing LRU items for quite some time.
So here's a typical excerpt from a common test:
kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ...
kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309
entries removed, 2972 remaining
nfsd-1532 [015] 266.988626: nfsd_file_free: ...
Here, the nfsd_file_unhash record marks the beginning of the LRU
walk, and the nfsd_file_gc_removed record marks the end. The
timestamps indicate the walk took two milliseconds.
The nfsd_file_free record above marks the last disposal activity.
That takes almost a millisecond, but as far as I can tell, it
does not hold any locks for long.
This seems to me like a strong argument for cutting the scan size
down to no more than 32-64 items. Ideally spin locks are supposed
to be held only for simple operations (eg, list_add); this seems a
little outside that window (hence your remark that "a large
nr_to_walk is always a bad idea" -- I now see what you meant).
IMHO the patch description should single out that purpose: We want to
significantly reduce the maximum amount of time that list_lru_walk()
blocks foreground LRU activity such as list_lru_add_obj().
The cond_resched() in this case might be gravy.
+ ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
+ &dispose, num_to_scan);
+ trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
+ nfsd_file_dispose_list_delayed(&dispose);
I need to go back and review the function traces to see where the
delays add up -- to make sure rescheduling here, rather than at some
other point, is appropriate. It probably is, but my memory fails me
these days.
I would like to see those function traces too.
Here's my reproducer:
1. On the client: Set up xfstests to use NFSv3
2. On the server: "sudo trace-cmd record -e nfsd &"
3. On the client: Run xfstests with "sudo ./check -nfs generic/750"
--
Chuck Lever