Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again

"NeilBrown" <neilb@xxxxxxx> · Mon, 06 Jan 2025 09:28:52 +1100

On Sun, 05 Jan 2025, Chuck Lever wrote:
> On 1/3/25 5:53 PM, Tejun Heo wrote:
> > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
> >> On Fri, 03 Jan 2025, cel@xxxxxxxxxx wrote:
> > ...
> >> I think that instead of passing "list_lru_count()" we should pass some
> >> constant like 1024.
> >>
> >> cnt = list_lru_count()
> >> while (cnt > 0) {
> >>     num = min(cnt, 1024);
> >>     list_lru_walk(...., num);
> >>     cond_sched()
> >>     cnt -= num;
> >> }
> >>
> >> Then run it from system_wq.
> >>
> >> list_lru_shrink is most often called as list_lru_shrink_walk() from a
> >> shrinker, and the pattern there is essentially that above.  A count is
> >> taken, possibly scaled down, then the shrinker is called in batches.
> > 
> > BTW, there's nothing wrong with taking some msecs or even tens of msecs
> > running on system_unbound_wq, so the current state may be fine too.
> 
> My thinking was that this work is low priority, so there should be
> plenty of opportunity to set it aside for a few moments and handle
> higher priority work. Maybe not worrisome on systems with a high core
> count, but on small SMP (eg VM guests), I've found that tasks like this
> can be rude neighbors.

None of the different work queues are expected to "set aside" the work
for more than a normal scheduling time slice.
system_long_wq was created "so that flush_scheduled_work() doesn't take
so long" but flush_scheduled_work() is never called now (and would
generate a warning if it was) so system_long_wq should really be
removed. 

system_unbound_wq uses threads that are not bound to a CPU so the
scheduler can move them around.  That is most suitable for something
that might run for a long time.

system_wq is bound to the CPU that schedules the work, but it can run
multiple work items - potentially long running ones - without trouble by
helping the scheduler share the CPU among them.  This requires that they
can sleep of course.

I haven't seen the actually symptoms that resulted in the various
changes to this code, but that last point is I think the key one.  The
work item, like all kernel code, needs to have scheduler points if it
might run for a longish time.  list_lru_walk() doesn't contain any
scheduling points so giving a large "nr_to_walk" is always a bad idea.

> 
> We could do this by adding a cond_resched() call site in the loop,
> or take Neil's suggestion of breaking up the free list across multiple
> work items that handle one or just a few file releases each.

I guess I should send a proper patch.

NeilBrown

> 
> -- 
> Chuck Lever
>