On Sat, 2024-11-09 at 00:03 +0000, Trond Myklebust wrote: > On Fri, 2024-11-08 at 15:20 -0800, Dai Ngo wrote: > > Hi Trond, > > > > Currently cl_tasks is used to maintain the list of all rpc_task's > > for each rpc_clnt. > > > > Under heavy write load, we've seen this list grows to millions > > of entries. Even though the list is extremely long, the system > > still runs fine until the user wants to get the information of > > all active RPC tasks by doing: > > > > # cat /sys/kernel/debug/sunrpc/rpc_clnt/N/tasks > > > > When this happens, tasks_start() is called and it acquires the > > rpc_clnt.cl_lock to walk the cl_tasks list, returning one entry > > at a time to the caller. The cl_lock is held until all tasks on > > this list have been processed. > > > > While the cl_lock is held, completed RPC tasks have to spin wait > > in rpc_task_release_client for the cl_lock. If there are millions > > of entries in the cl_tasks list it will take a long time before > > tasks_stop is called and the cl_lock is released. > > > > Under heavy load condition the rpc_task_release_client threads > > will use up all the available CPUs in the system, preventing other > > jobs to run and this causes the system to temporarily lock up. > > > > I'm looking for suggestions on how to address this issue. I think > > one option is to convert the cl_tasks list to use xarray to > > eliminate > > the contention on the cl_lock and would like to get the opinion > > from the community. > > > No. We are definitely not going to add a gravity-challenged solution > like xarray to solve a corner-case problem of list iteration. > > Firstly, this is really only a problem for NFSv3 and NFSv4.0 because > they don't actually throttle at the NFS layer. Actually. Let me correct that... NFSv4.1 does throttle at the NFS layer, but does so in the RPC prepare callback, so perhaps it is affected here too. However we could reduce that problem by moving the addition of the rpc_task to the cl_tasks list to the call_start() function. Doing so leads to less visibility into the full workings of the system, however the active tasks will still be fully documented by the list, and if we need to, we could supplement that information with a total number of queued tasks. > > Secondly, having millions of entries associated with a single struct > rpc_clnt, means living in latency hell, where waking up a sleeping > task > can mean living on the rpciod queue for several 100ms before > execution > starts due to the shear volume of tasks in the queue. This is still not a major problem for NFSv4.1 since we do have throttling happening immediately once the RPC call starts, and the task is never awakened until it can be accommodated with a session slot. > > So IMHO a better question would be: "What is a sensible throttling > scheme for NFSv3 and NFSv4.0?" Still a problem. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx