On Sun, 05 Jan 2025, Chuck Lever wrote: > On 1/3/25 5:53 PM, Tejun Heo wrote: > > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > >> On Fri, 03 Jan 2025, cel@xxxxxxxxxx wrote: > > ... > >> I think that instead of passing "list_lru_count()" we should pass some > >> constant like 1024. > >> > >> cnt = list_lru_count() > >> while (cnt > 0) { > >> num = min(cnt, 1024); > >> list_lru_walk(...., num); > >> cond_sched() > >> cnt -= num; > >> } > >> > >> Then run it from system_wq. > >> > >> list_lru_shrink is most often called as list_lru_shrink_walk() from a > >> shrinker, and the pattern there is essentially that above. A count is > >> taken, possibly scaled down, then the shrinker is called in batches. > > > > BTW, there's nothing wrong with taking some msecs or even tens of msecs > > running on system_unbound_wq, so the current state may be fine too. > > My thinking was that this work is low priority, so there should be > plenty of opportunity to set it aside for a few moments and handle > higher priority work. Maybe not worrisome on systems with a high core > count, but on small SMP (eg VM guests), I've found that tasks like this > can be rude neighbors. None of the different work queues are expected to "set aside" the work for more than a normal scheduling time slice. system_long_wq was created "so that flush_scheduled_work() doesn't take so long" but flush_scheduled_work() is never called now (and would generate a warning if it was) so system_long_wq should really be removed. system_unbound_wq uses threads that are not bound to a CPU so the scheduler can move them around. That is most suitable for something that might run for a long time. system_wq is bound to the CPU that schedules the work, but it can run multiple work items - potentially long running ones - without trouble by helping the scheduler share the CPU among them. This requires that they can sleep of course. I haven't seen the actually symptoms that resulted in the various changes to this code, but that last point is I think the key one. The work item, like all kernel code, needs to have scheduler points if it might run for a longish time. list_lru_walk() doesn't contain any scheduling points so giving a large "nr_to_walk" is always a bad idea. > > We could do this by adding a cond_resched() call site in the loop, > or take Neil's suggestion of breaking up the free list across multiple > work items that handle one or just a few file releases each. I guess I should send a proper patch. NeilBrown > > -- > Chuck Lever >