> On May 13, 2022, at 9:30 AM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote: >> On 5/6/22 9:24 AM, Dennis Dalessandro wrote: >>> On 4/29/22 9:37 AM, Chuck Lever III wrote: >>>> >>>> >>>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro >>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote: >>>>> >>>>> On 4/28/22 3:56 PM, Trond Myklebust wrote: >>>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: >>>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>>>>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> Hi NFS folks, >>>>>>>>>> >>>>>>>>>> I've noticed a pretty nasty regression in our NFS >>>>>>>>>> capability >>>>>>>>>> between 5.17 and >>>>>>>>>> 5.18-rc1. I've tried to bisect but not having any >>>>>>>>>> luck. The >>>>>>>>>> problem I'm seeing >>>>>>>>>> is it takes 3 minutes to copy a file from NFS to the >>>>>>>>>> local >>>>>>>>>> disk. When it should >>>>>>>>>> take less than half a second, which it did up through >>>>>>>>>> 5.17. >>>>>>>>>> >>>>>>>>>> It doesn't seem to be network related, but can't rule >>>>>>>>>> that out >>>>>>>>>> completely. >>>>>>>>>> >>>>>>>>>> I tried to bisect but the problem can be >>>>>>>>>> intermittent. Some >>>>>>>>>> runs I'll see a >>>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of >>>>>>>>>> 100. >>>>>>>>>> Sometimes I'll see it >>>>>>>>>> 100 out of 100. >>>>>>>>> >>>>>>>>> It's not clear from your problem report whether the >>>>>>>>> problem >>>>>>>>> appears >>>>>>>>> when it's the server running v5.18-rc or the client. >>>>>>>> >>>>>>>> That's because I don't know which it is. I'll do a quick >>>>>>>> test and >>>>>>>> find out. I >>>>>>>> was testing the same kernel across both nodes. >>>>>>> >>>>>>> Looks like it is the client. >>>>>>> >>>>>>> server client result >>>>>>> ------ ------ ------ >>>>>>> 5.17 5.17 Pass >>>>>>> 5.17 5.18 Fail >>>>>>> 5.18 5.18 Fail >>>>>>> 5.18 5.17 Pass >>>>>>> >>>>>>> Is there a patch for the client issue you mentioned that I >>>>>>> could try? >>>>>>> >>>>>>> -Denny >>>>>> >>>>>> Try this one >>>>> >>>>> Thanks for the patch. Unfortunately it doesn't seem to solve >>>>> the issue, still >>>>> see intermittent hangs. I applied it on top of -rc4: >>>>> >>>>> copy /mnt/nfs_test/run_nfs_test.junk to >>>>> /dev/shm/run_nfs_test.tmp... >>>>> >>>>> real 2m6.072s >>>>> user 0m0.002s >>>>> sys 0m0.263s >>>>> Done >>>>> >>>>> While it was hung I checked the mem usage on the machine: >>>>> >>>>> # free -h >>>>> total used free shared >>>>> buff/cache available >>>>> Mem: 62Gi 871Mi 61Gi 342Mi >>>>> 889Mi 61Gi >>>>> Swap: 4.0Gi 0B 4.0Gi >>>>> >>>>> Doesn't appear to be under memory pressure. >>>> >>>> Hi, since you know now that it is the client, perhaps a bisect >>>> would be more successful? >>> >>> I've been testing all week. I pulled the nfs-rdma tree that was >>> sent to Linus >>> for 5.18 and tested. I see the problem on pretty much all the >>> patches. However >>> it's the frequency that it hits which changes. >>> >>> I'll see 1-5 cycles out of 2500 where the copy takes minutes up to: >>> "NFS: Convert readdir page cache to use a cookie based index" >>> >>> After this I start seeing it around 10 times in 500 and by the last >>> patch 10 >>> times in less than 100. >>> >>> Is there any kind of tracing/debugging I could turn on to get more >>> insight on >>> what is taking so long when it does go bad? >>> >> >> Ran a test with -rc6 and this time see a hung task trace on the >> console as well >> as an NFS RPC error. >> >> [32719.991175] nfs: RPC call returned error 512 >> . >> . >> . >> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >> than 122 seconds. >> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this >> message. >> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 >> ppid: 2 >> flags:0x00004000 >> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >> [32933.324524] Call Trace: >> [32933.327347] <TASK> >> [32933.329785] __schedule+0x3dd/0x970 >> [32933.333783] schedule+0x41/0xa0 >> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >> [32933.376564] ? __switch_to_asm+0x42/0x70 >> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >> [32933.391391] process_one_work+0x1c8/0x390 >> [32933.395975] worker_thread+0x30/0x360 >> [32933.400162] ? process_one_work+0x390/0x390 >> [32933.404931] kthread+0xd9/0x100 >> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >> [32933.413984] ret_from_fork+0x22/0x30 >> [32933.418074] </TASK> >> >> The call trace shows up again at 245, 368, and 491 seconds. Same >> task, same trace. >> >> >> >> > > That's very helpful. The above trace suggests that the RDMA code is > leaking a call to xprt_unpin_rqst(). IMHO this is unlikely to be related to the performance regression -- none of this code has changed in the past 5 kernel releases. Could be a different issue, though. As is often the case in these situations, the INFO trace above happens long after the issue that caused the missing unpin. So... unless Dennis has a reproducer that can trigger the issue frequently, I don't think there's much that can be extracted from that. Also "nfs: RPC call returned error 512" suggests someone hit ^C at some point. It's always possible that the xprt_rdma_free() path is missing an unpin. But again, that's not likely to be related to performance. > From what I can see, rpcrdma_reply_handler() will always call > xprt_pin_rqst(), and that call is required to be matched by a call to > xprt_unpin_rqst() before the handler exits. However those calls to > xprt_unpin_rqst() appear to be scattered around the code, and it is not > at all obvious that the requirement is being fulfilled. I don't think unpin is required before rpcrdma_reply_handler() exits. It /is/ required after xprt_complete_rqst(), and sometimes that call has to be deferred until the NIC is fully done with the RPC's buffers. The NIC is often not finished with the buffers in the Receive completion handler. > Chuck, why does that code need to be so complex? The overall goal is that the upper layer cannot be awoken until the NIC has finished with the RPC's header and payload buffers. These buffers are treated differently depending on how the send path chose to marshal the RPC. And, all of these are hot paths, so it's important to ensure that we are waiting/context-switching as infrequently as possible. We wake the upper layer waiter during an RDMA completion to reduce context switching. The first completion does not always mean there's nothing left for the NIC to do. Ignoring the server backchannel, there is a single unpin in the "complete without error" case in rpcrdma_complete_rqst(). There are three call sites for rpcrdma_complete_rqst: 1. rpcrdma_reply_done - called when no explicit invalidation is needed and the Receive completion fired after the Send completion. 2. rpcrdma_sendctx_done - called when no explicit invalidation is needed and the Send completion fired after the Receive completion. 3. frwr_wc_localinv_done - called when explicit invalidation was needed and that invalidation has completed. SQ ordering guarantees that the Send has also completed at this point, so a separate wait for Send completion is not needed. These are the points when the NIC has finished with all of the memory resources associated with an RPC, so that's when it is safe to wake the waiter and unpin the RPC. -- Chuck Lever