On Tue, 07 Jan 2025, Chuck Lever wrote: > On 1/5/25 10:02 PM, NeilBrown wrote: > > On Mon, 06 Jan 2025, Chuck Lever wrote: > >> On 1/5/25 6:11 PM, NeilBrown wrote: > > >>> + unsigned long num_to_scan = min(cnt, 1024UL); > >> > >> I see long delays with fewer than 1024 items on the list. I might > >> drop this number by one or two orders of magnitude. And make it a > >> symbolic constant. > > > > In that case I seriously wonder if this is where the delays are coming > > from. > > > > nfsd_file_dispose_list_delayed() does take and drop a spinlock > > repeatedly (though it may not always be the same lock) and call > > svc_wake_up() repeatedly - although the head of the queue might already > > be woken. We could optimise that to detect runs with the same nn and > > only take the lock once, and only wake_up once. > > > >> > >> There's another naked integer (8) in nfsd_file_net_dispose() -- how does > >> that relate to this new cap? Should that also be a symbolic constant? > > > > I don't think they relate. > > The trade-off with "8" is: > > a bigger number might block an nfsd thread for longer, > > forcing serialising when the work can usefully be done in parallel. > > a smaller number might needlessly wake lots of threads > > to share out a tiny amount of work. > > > > The 1024 is simply about "don't hold a spinlock for too long". > > By that, I think you mean list_lru_walk() takes &l->lock for the > duration of the scan? For a long scan, that would effectively block > adding or removing LRU items for quite some time. > > So here's a typical excerpt from a common test: > > kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ... > > kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309 > entries removed, 2972 remaining > > nfsd-1532 [015] 266.988626: nfsd_file_free: ... > > Here, the nfsd_file_unhash record marks the beginning of the LRU > walk, and the nfsd_file_gc_removed record marks the end. The > timestamps indicate the walk took two milliseconds. > > The nfsd_file_free record above marks the last disposal activity. > That takes almost a millisecond, but as far as I can tell, it > does not hold any locks for long. > > This seems to me like a strong argument for cutting the scan size > down to no more than 32-64 items. Ideally spin locks are supposed > to be held only for simple operations (eg, list_add); this seems a > little outside that window (hence your remark that "a large > nr_to_walk is always a bad idea" -- I now see what you meant). This is useful - thanks. So the problem seems to be that holding the list_lru while canning the whole list can block all incoming NFSv3 for a noticeable amount of time - 2 msecs above. That makes perfect sense and as you say it suggests that the lack of scheduling points isn't really the issue. This confirms for me that the list_lru approach is no a good fit for this problem. I have written a patch which replaces it with a pair of simple lists as I described in my cover letter. It is a bit large and needs careful review. In particular I haven't given thought to the tracepoints which might need to be moved or changed or discarded. Thanks, NeilBrown