On Fri, 2022-05-13 at 14:59 +0000, Chuck Lever III wrote: > > > > On May 13, 2022, at 9:30 AM, Trond Myklebust > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote: > > > On 5/6/22 9:24 AM, Dennis Dalessandro wrote: > > > > On 4/29/22 9:37 AM, Chuck Lever III wrote: > > > > > > > > > > > > > > > > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro > > > > > > <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On 4/28/22 3:56 PM, Trond Myklebust wrote: > > > > > > > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro > > > > > > > wrote: > > > > > > > > On 4/28/22 11:42 AM, Dennis Dalessandro wrote: > > > > > > > > > On 4/28/22 10:57 AM, Chuck Lever III wrote: > > > > > > > > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro > > > > > > > > > > > <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > Hi NFS folks, > > > > > > > > > > > > > > > > > > > > > > I've noticed a pretty nasty regression in our NFS > > > > > > > > > > > capability > > > > > > > > > > > between 5.17 and > > > > > > > > > > > 5.18-rc1. I've tried to bisect but not having any > > > > > > > > > > > luck. The > > > > > > > > > > > problem I'm seeing > > > > > > > > > > > is it takes 3 minutes to copy a file from NFS to > > > > > > > > > > > the > > > > > > > > > > > local > > > > > > > > > > > disk. When it should > > > > > > > > > > > take less than half a second, which it did up > > > > > > > > > > > through > > > > > > > > > > > 5.17. > > > > > > > > > > > > > > > > > > > > > > It doesn't seem to be network related, but can't > > > > > > > > > > > rule > > > > > > > > > > > that out > > > > > > > > > > > completely. > > > > > > > > > > > > > > > > > > > > > > I tried to bisect but the problem can be > > > > > > > > > > > intermittent. Some > > > > > > > > > > > runs I'll see a > > > > > > > > > > > problem in 3 out of 100 cycles, sometimes 0 out > > > > > > > > > > > of > > > > > > > > > > > 100. > > > > > > > > > > > Sometimes I'll see it > > > > > > > > > > > 100 out of 100. > > > > > > > > > > > > > > > > > > > > It's not clear from your problem report whether the > > > > > > > > > > problem > > > > > > > > > > appears > > > > > > > > > > when it's the server running v5.18-rc or the > > > > > > > > > > client. > > > > > > > > > > > > > > > > > > That's because I don't know which it is. I'll do a > > > > > > > > > quick > > > > > > > > > test and > > > > > > > > > find out. I > > > > > > > > > was testing the same kernel across both nodes. > > > > > > > > > > > > > > > > Looks like it is the client. > > > > > > > > > > > > > > > > server client result > > > > > > > > ------ ------ ------ > > > > > > > > 5.17 5.17 Pass > > > > > > > > 5.17 5.18 Fail > > > > > > > > 5.18 5.18 Fail > > > > > > > > 5.18 5.17 Pass > > > > > > > > > > > > > > > > Is there a patch for the client issue you mentioned > > > > > > > > that I > > > > > > > > could try? > > > > > > > > > > > > > > > > -Denny > > > > > > > > > > > > > > Try this one > > > > > > > > > > > > Thanks for the patch. Unfortunately it doesn't seem to > > > > > > solve > > > > > > the issue, still > > > > > > see intermittent hangs. I applied it on top of -rc4: > > > > > > > > > > > > copy /mnt/nfs_test/run_nfs_test.junk to > > > > > > /dev/shm/run_nfs_test.tmp... > > > > > > > > > > > > real 2m6.072s > > > > > > user 0m0.002s > > > > > > sys 0m0.263s > > > > > > Done > > > > > > > > > > > > While it was hung I checked the mem usage on the machine: > > > > > > > > > > > > # free -h > > > > > > total used free shared > > > > > > buff/cache available > > > > > > Mem: 62Gi 871Mi 61Gi > > > > > > 342Mi > > > > > > 889Mi 61Gi > > > > > > Swap: 4.0Gi 0B 4.0Gi > > > > > > > > > > > > Doesn't appear to be under memory pressure. > > > > > > > > > > Hi, since you know now that it is the client, perhaps a > > > > > bisect > > > > > would be more successful? > > > > > > > > I've been testing all week. I pulled the nfs-rdma tree that was > > > > sent to Linus > > > > for 5.18 and tested. I see the problem on pretty much all the > > > > patches. However > > > > it's the frequency that it hits which changes. > > > > > > > > I'll see 1-5 cycles out of 2500 where the copy takes minutes up > > > > to: > > > > "NFS: Convert readdir page cache to use a cookie based index" > > > > > > > > After this I start seeing it around 10 times in 500 and by the > > > > last > > > > patch 10 > > > > times in less than 100. > > > > > > > > Is there any kind of tracing/debugging I could turn on to get > > > > more > > > > insight on > > > > what is taking so long when it does go bad? > > > > > > > > > > Ran a test with -rc6 and this time see a hung task trace on the > > > console as well > > > as an NFS RPC error. > > > > > > [32719.991175] nfs: RPC call returned error 512 > > > . > > > . > > > . > > > [32933.285126] INFO: task kworker/u145:23:886141 blocked for more > > > than 122 seconds. > > > [32933.293543] Tainted: G S 5.18.0-rc6 #1 > > > [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > > disables this > > > message. > > > [32933.308740] task:kworker/u145:23 state:D stack: 0 > > > pid:886141 > > > ppid: 2 > > > flags:0x00004000 > > > [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] > > > [32933.324524] Call Trace: > > > [32933.327347] <TASK> > > > [32933.329785] __schedule+0x3dd/0x970 > > > [32933.333783] schedule+0x41/0xa0 > > > [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] > > > [32933.343639] ? prepare_to_wait+0xd0/0xd0 > > > [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] > > > [32933.354183] xprt_release+0x26/0x140 [sunrpc] > > > [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] > > > [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] > > > [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] > > > [32933.376564] ? __switch_to_asm+0x42/0x70 > > > [32933.381046] ? finish_task_switch+0xb2/0x2c0 > > > [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] > > > [32933.391391] process_one_work+0x1c8/0x390 > > > [32933.395975] worker_thread+0x30/0x360 > > > [32933.400162] ? process_one_work+0x390/0x390 > > > [32933.404931] kthread+0xd9/0x100 > > > [32933.408536] ? kthread_complete_and_exit+0x20/0x20 > > > [32933.413984] ret_from_fork+0x22/0x30 > > > [32933.418074] </TASK> > > > > > > The call trace shows up again at 245, 368, and 491 seconds. Same > > > task, same trace. > > > > > > > > > > > > > > > > That's very helpful. The above trace suggests that the RDMA code is > > leaking a call to xprt_unpin_rqst(). > > IMHO this is unlikely to be related to the performance > regression -- none of this code has changed in the past 5 > kernel releases. Could be a different issue, though. > > As is often the case in these situations, the INFO trace > above happens long after the issue that caused the missing > unpin. So... unless Dennis has a reproducer that can trigger > the issue frequently, I don't think there's much that can > be extracted from that. > > Also "nfs: RPC call returned error 512" suggests someone > hit ^C at some point. It's always possible that the > xprt_rdma_free() path is missing an unpin. But again, > that's not likely to be related to performance. > > > > From what I can see, rpcrdma_reply_handler() will always call > > xprt_pin_rqst(), and that call is required to be matched by a call > > to > > xprt_unpin_rqst() before the handler exits. However those calls to > > xprt_unpin_rqst() appear to be scattered around the code, and it is > > not > > at all obvious that the requirement is being fulfilled. > > I don't think unpin is required before rpcrdma_reply_handler() > exits. It /is/ required after xprt_complete_rqst(), and > sometimes that call has to be deferred until the NIC is > fully done with the RPC's buffers. The NIC is often not > finished with the buffers in the Receive completion handler. > > > > Chuck, why does that code need to be so complex? > > The overall goal is that the upper layer cannot be awoken > until the NIC has finished with the RPC's header and > payload buffers. These buffers are treated differently > depending on how the send path chose to marshal the RPC. > > And, all of these are hot paths, so it's important to > ensure that we are waiting/context-switching as > infrequently as possible. We wake the upper layer waiter > during an RDMA completion to reduce context switching. > The first completion does not always mean there's nothing > left for the NIC to do. > > Ignoring the server backchannel, there is a single unpin > in the "complete without error" case in > rpcrdma_complete_rqst(). > > There are three call sites for rpcrdma_complete_rqst: > > 1. rpcrdma_reply_done - called when no explicit invalidation > is needed and the Receive completion fired after the Send > completion. > > 2. rpcrdma_sendctx_done - called when no explicit invalidation > is needed and the Send completion fired after the Receive > completion. > > 3. frwr_wc_localinv_done - called when explicit invalidation > was needed and that invalidation has completed. SQ ordering > guarantees that the Send has also completed at this point, > so a separate wait for Send completion is not needed. > > These are the points when the NIC has finished with all of > the memory resources associated with an RPC, so that's > when it is safe to wake the waiter and unpin the RPC. > >From what I can see, what's important in the case of RDMA is to ensure that we call exactly 1 of either frwr_unmap_async(), frwr_unmap_sync() or rpcrdma_complete_rqst(), and that whichever operation we select must completed before we call xprt->ops->free_slot(). xprt_pin_rqst()/xprt_unpin_rqst() can help to ensure exclusion between those 3 operations, but it is not required if there are other ways to do so. In fact since xprt_wait_on_pinned_rqst() forces a synchronous sleep in xprt_release(), it would be much preferable to find a better solution than to pin the request if you concern is performance. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx