On Thu, 09 Jan 2025, Chuck Lever wrote: > On 1/7/25 6:01 PM, NeilBrown wrote: > > On Tue, 07 Jan 2025, Chuck Lever wrote: > >> On 1/5/25 10:02 PM, NeilBrown wrote: > >>> On Mon, 06 Jan 2025, Chuck Lever wrote: > >>>> On 1/5/25 6:11 PM, NeilBrown wrote: > >> > >>>>> + unsigned long num_to_scan = min(cnt, 1024UL); > >>>> > >>>> I see long delays with fewer than 1024 items on the list. I might > >>>> drop this number by one or two orders of magnitude. And make it a > >>>> symbolic constant. > >>> > >>> In that case I seriously wonder if this is where the delays are coming > >>> from. > >>> > >>> nfsd_file_dispose_list_delayed() does take and drop a spinlock > >>> repeatedly (though it may not always be the same lock) and call > >>> svc_wake_up() repeatedly - although the head of the queue might already > >>> be woken. We could optimise that to detect runs with the same nn and > >>> only take the lock once, and only wake_up once. > >>> > >>>> > >>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does > >>>> that relate to this new cap? Should that also be a symbolic constant? > >>> > >>> I don't think they relate. > >>> The trade-off with "8" is: > >>> a bigger number might block an nfsd thread for longer, > >>> forcing serialising when the work can usefully be done in parallel. > >>> a smaller number might needlessly wake lots of threads > >>> to share out a tiny amount of work. > >>> > >>> The 1024 is simply about "don't hold a spinlock for too long". > >> > >> By that, I think you mean list_lru_walk() takes &l->lock for the > >> duration of the scan? For a long scan, that would effectively block > >> adding or removing LRU items for quite some time. > >> > >> So here's a typical excerpt from a common test: > >> > >> kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ... > >> > >> kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309 > >> entries removed, 2972 remaining > >> > >> nfsd-1532 [015] 266.988626: nfsd_file_free: ... > >> > >> Here, the nfsd_file_unhash record marks the beginning of the LRU > >> walk, and the nfsd_file_gc_removed record marks the end. The > >> timestamps indicate the walk took two milliseconds. > >> > >> The nfsd_file_free record above marks the last disposal activity. > >> That takes almost a millisecond, but as far as I can tell, it > >> does not hold any locks for long. > >> > >> This seems to me like a strong argument for cutting the scan size > >> down to no more than 32-64 items. Ideally spin locks are supposed > >> to be held only for simple operations (eg, list_add); this seems a > >> little outside that window (hence your remark that "a large > >> nr_to_walk is always a bad idea" -- I now see what you meant). > > > > This is useful - thanks. > > So the problem seems to be that holding the list_lru while canning the > > whole list can block all incoming NFSv3 for a noticeable amount of time > > - 2 msecs above. That makes perfect sense and as you say it suggests > > that the lack of scheduling points isn't really the issue. > > > > This confirms for me that the list_lru approach is no a good fit for > > this problem. I have written a patch which replaces it with a pair of > > simple lists as I described in my cover letter. > > Before proceeding with replacement of the LRU, is there interest in > addressing this issue in LTS kernels as well? If so, then IMO the > better approach would be to take a variant of your narrower fix for > v6.14, and then visit the deeper LRU changes for v6.15ff. That is probably reasonable. You could take the first patch, drop the 1024 to 64 (or less if testing suggests that is still too high), and maybe drop he cond_resched(). Thanks, NeilBrown