On Mon, 06 Jan 2025, Chuck Lever wrote: > On 1/5/25 5:28 PM, NeilBrown wrote: > > On Sun, 05 Jan 2025, Chuck Lever wrote: > >> On 1/3/25 5:53 PM, Tejun Heo wrote: > >>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > >>>> On Fri, 03 Jan 2025, cel@xxxxxxxxxx wrote: > >>> ... > >>>> I think that instead of passing "list_lru_count()" we should pass some > >>>> constant like 1024. > >>>> > >>>> cnt = list_lru_count() > >>>> while (cnt > 0) { > >>>> num = min(cnt, 1024); > >>>> list_lru_walk(...., num); > >>>> cond_sched() > >>>> cnt -= num; > >>>> } > >>>> > >>>> Then run it from system_wq. > >>>> > >>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a > >>>> shrinker, and the pattern there is essentially that above. A count is > >>>> taken, possibly scaled down, then the shrinker is called in batches. > >>> > >>> BTW, there's nothing wrong with taking some msecs or even tens of msecs > >>> running on system_unbound_wq, so the current state may be fine too. > >> > >> My thinking was that this work is low priority, so there should be > >> plenty of opportunity to set it aside for a few moments and handle > >> higher priority work. Maybe not worrisome on systems with a high core > >> count, but on small SMP (eg VM guests), I've found that tasks like this > >> can be rude neighbors. > > > > None of the different work queues are expected to "set aside" the work > > for more than a normal scheduling time slice. > > system_long_wq was created "so that flush_scheduled_work() doesn't take > > so long" but flush_scheduled_work() is never called now (and would > > generate a warning if it was) so system_long_wq should really be > > removed. > > > > system_unbound_wq uses threads that are not bound to a CPU so the > > scheduler can move them around. That is most suitable for something > > that might run for a long time. > > That's not my understanding: unbound selects a CPU to run the work > item on, and it can be (and usually is) not the same CPU as the one > that invoked queue_work(). Then it isn't rescheduled. The work items > are expected to complete quickly; work item termination is the typical > reschedule point. My understanding, as always, could be stale. If the work item will never schedule, then there isn't much point in using a work queue. The code can be run inline, or with a timer. > > But that's neither here nor there: I've dropped the patch to replace > system_unbound_wq with system_long_wq. > > > > system_wq is bound to the CPU that schedules the work, but it can run > > multiple work items - potentially long running ones - without trouble by > > helping the scheduler share the CPU among them. This requires that they > > can sleep of course. > > > > I haven't seen the actually symptoms that resulted in the various > > changes to this code, but that last point is I think the key one. The > > work item, like all kernel code, needs to have scheduler points if it > > might run for a longish time. list_lru_walk() doesn't contain any > > scheduling points so giving a large "nr_to_walk" is always a bad idea. > > That might be a overstated... I don't see other consumers that are so > concerned about rescheduling. But then it's not clear if they are > dealing with lengthy LRU lists. There are very few callers of list_lru_walk(). shrink_dcache_sb() is the closest analogue and it uses batches of 1024. xfs_buftarg_drain() uses batches of LONG_MAX !!! Mostly the lru is walked in a shrinker callback, and the shrinker manages the batch size. Thanks, NeilBrown > > > >> We could do this by adding a cond_resched() call site in the loop, > >> or take Neil's suggestion of breaking up the free list across multiple > >> work items that handle one or just a few file releases each. > > > > I guess I should send a proper patch. > > More comments in reply to that patch. Generally speaking, I'm > comfortable chopping up the work as you propose, we just have to > address some details. > > > -- > Chuck Lever >