On 11/8/24 4:40 PM, Trond Myklebust wrote:
On Sat, 2024-11-09 at 00:03 +0000, Trond Myklebust wrote:
On Fri, 2024-11-08 at 15:20 -0800, Dai Ngo wrote:
Hi Trond,
Currently cl_tasks is used to maintain the list of all rpc_task's
for each rpc_clnt.
Under heavy write load, we've seen this list grows to millions
of entries. Even though the list is extremely long, the system
still runs fine until the user wants to get the information of
all active RPC tasks by doing:
# cat /sys/kernel/debug/sunrpc/rpc_clnt/N/tasks
When this happens, tasks_start() is called and it acquires the
rpc_clnt.cl_lock to walk the cl_tasks list, returning one entry
at a time to the caller. The cl_lock is held until all tasks on
this list have been processed.
While the cl_lock is held, completed RPC tasks have to spin wait
in rpc_task_release_client for the cl_lock. If there are millions
of entries in the cl_tasks list it will take a long time before
tasks_stop is called and the cl_lock is released.
Under heavy load condition the rpc_task_release_client threads
will use up all the available CPUs in the system, preventing other
jobs to run and this causes the system to temporarily lock up.
I'm looking for suggestions on how to address this issue. I think
one option is to convert the cl_tasks list to use xarray to
eliminate
the contention on the cl_lock and would like to get the opinion
from the community.
No. We are definitely not going to add a gravity-challenged solution
like xarray to solve a corner-case problem of list iteration.
Firstly, this is really only a problem for NFSv3 and NFSv4.0 because
they don't actually throttle at the NFS layer.
Actually. Let me correct that...
NFSv4.1 does throttle at the NFS layer, but does so in the RPC prepare
callback, so perhaps it is affected here too.
Yes, 4.1 is also effected even with throttling by session slots because
the RPC task is put on the cl_tasks list as soon as it is created.
However we could reduce that problem by moving the addition of the
rpc_task to the cl_tasks list to the call_start() function.
This should work for 4.1.
Doing so
leads to less visibility into the full workings of the system, however
the active tasks will still be fully documented by the list, and if we
need to, we could supplement that information with a total number of
queued tasks.
Yes, it's good to know the number of tasks existed in the system.
Secondly, having millions of entries associated with a single struct
rpc_clnt, means living in latency hell, where waking up a sleeping
task
can mean living on the rpciod queue for several 100ms before
execution
starts due to the shear volume of tasks in the queue.
This is still not a major problem for NFSv4.1 since we do have
throttling happening immediately once the RPC call starts, and the task
is never awakened until it can be accommodated with a session slot.
So IMHO a better question would be: "What is a sensible throttling
scheme for NFSv3 and NFSv4.0?"
Still a problem.
Perhaps we can put the task on the cl_tasks list in call_reserve
after the rpc_rqst is allocated?
Thank you Trond for your help!
-Dai