On 1/5/25 5:28 PM, NeilBrown wrote:
On Sun, 05 Jan 2025, Chuck Lever wrote:
On 1/3/25 5:53 PM, Tejun Heo wrote:
On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
On Fri, 03 Jan 2025, cel@xxxxxxxxxx wrote:
...
I think that instead of passing "list_lru_count()" we should pass some
constant like 1024.
cnt = list_lru_count()
while (cnt > 0) {
num = min(cnt, 1024);
list_lru_walk(...., num);
cond_sched()
cnt -= num;
}
Then run it from system_wq.
list_lru_shrink is most often called as list_lru_shrink_walk() from a
shrinker, and the pattern there is essentially that above. A count is
taken, possibly scaled down, then the shrinker is called in batches.
BTW, there's nothing wrong with taking some msecs or even tens of msecs
running on system_unbound_wq, so the current state may be fine too.
My thinking was that this work is low priority, so there should be
plenty of opportunity to set it aside for a few moments and handle
higher priority work. Maybe not worrisome on systems with a high core
count, but on small SMP (eg VM guests), I've found that tasks like this
can be rude neighbors.
None of the different work queues are expected to "set aside" the work
for more than a normal scheduling time slice.
system_long_wq was created "so that flush_scheduled_work() doesn't take
so long" but flush_scheduled_work() is never called now (and would
generate a warning if it was) so system_long_wq should really be
removed.
system_unbound_wq uses threads that are not bound to a CPU so the
scheduler can move them around. That is most suitable for something
that might run for a long time.
That's not my understanding: unbound selects a CPU to run the work
item on, and it can be (and usually is) not the same CPU as the one
that invoked queue_work(). Then it isn't rescheduled. The work items
are expected to complete quickly; work item termination is the typical
reschedule point. My understanding, as always, could be stale.
But that's neither here nor there: I've dropped the patch to replace
system_unbound_wq with system_long_wq.
system_wq is bound to the CPU that schedules the work, but it can run
multiple work items - potentially long running ones - without trouble by
helping the scheduler share the CPU among them. This requires that they
can sleep of course.
I haven't seen the actually symptoms that resulted in the various
changes to this code, but that last point is I think the key one. The
work item, like all kernel code, needs to have scheduler points if it
might run for a longish time. list_lru_walk() doesn't contain any
scheduling points so giving a large "nr_to_walk" is always a bad idea.
That might be a overstated... I don't see other consumers that are so
concerned about rescheduling. But then it's not clear if they are
dealing with lengthy LRU lists.
We could do this by adding a cond_resched() call site in the loop,
or take Neil's suggestion of breaking up the free list across multiple
work items that handle one or just a few file releases each.
I guess I should send a proper patch.
More comments in reply to that patch. Generally speaking, I'm
comfortable chopping up the work as you propose, we just have to
address some details.
--
Chuck Lever