Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again

"NeilBrown" <neilb@xxxxxxx> · Sat, 04 Jan 2025 09:50:32 +1100

On Fri, 03 Jan 2025, cel@xxxxxxxxxx wrote:
> From: Chuck Lever <chuck.lever@xxxxxxxxxx>
> 
> Youzhong Yang <youzhong@xxxxxxxxx> noticed the workqueue subsystem
> complaining about how long the filecache laundrette was running.
> This resulted in switching from using the system_wq for the
> laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use
> system_unbound_wq for nfsd_file_gc_worker()").
> 
> However, I've seen the laundrette running for multiple milliseconds
> on some workloads, delaying other work. For the purpose of
> scheduling fairness, perhaps a better choice would be to process
> filecache disposal queues on the system_long_wq instead.
> 
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> ---
>  fs/nfsd/filecache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a1cdba42c4fa..91a535c2dede 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -112,7 +112,7 @@ static void
>  nfsd_file_schedule_laundrette(void)
>  {
>  	if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
> -		queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
> +		queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette,
>  				   NFSD_LAUNDRETTE_DELAY);

This doesn't seem right to me.
"unbound_wq" is not bound to an CPU.  It just gets allocated that thread
which shouldn't interfere with any other work (except for normal
scheduling contention) until the queue limit is reach, which is unlikely
to happen often.

"long_wq" IS bound to a CPU and is concurrency managed so it could
interfere with other things.  The idea is that work items scheduled
there are not in a hurry and could take longer.  But I dont' think they
should take very long...

I think the problem is elsewhere.
list_lru_walk() can hold a spin lock for as long as it takes to walk
nr_to_walk objects.  This will tie up the CPU no matter which work queue
it is running on.

I think that instead of passing "list_lru_count()" we should pass some
constant like 1024.

cnt = list_lru_count()
while (cnt > 0) {
   num = min(cnt, 1024);
   list_lru_walk(...., num);
   cond_sched()
   cnt -= num;
}

Then run it from system_wq.

list_lru_shrink is most often called as list_lru_shrink_walk() from a
shrinker, and the pattern there is essentially that above.  A count is
taken, possibly scaled down, then the shrinker is called in batches.

NeilBrown