Re: NFS regression between 5.17 and 5.18

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 13 May 2022 15:19:54 +0000

On Fri, 2022-05-13 at 14:59 +0000, Chuck Lever III wrote:
> 
> 
> > On May 13, 2022, at 9:30 AM, Trond Myklebust
> > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
> > > On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
> > > > On 4/29/22 9:37 AM, Chuck Lever III wrote:
> > > > > 
> > > > > 
> > > > > > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
> > > > > > <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > 
> > > > > > On 4/28/22 3:56 PM, Trond Myklebust wrote:
> > > > > > > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro
> > > > > > > wrote:
> > > > > > > > On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
> > > > > > > > > On 4/28/22 10:57 AM, Chuck Lever III wrote:
> > > > > > > > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
> > > > > > > > > > > <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Hi NFS folks,
> > > > > > > > > > > 
> > > > > > > > > > > I've noticed a pretty nasty regression in our NFS
> > > > > > > > > > > capability
> > > > > > > > > > > between 5.17 and
> > > > > > > > > > > 5.18-rc1. I've tried to bisect but not having any
> > > > > > > > > > > luck. The
> > > > > > > > > > > problem I'm seeing
> > > > > > > > > > > is it takes 3 minutes to copy a file from NFS to
> > > > > > > > > > > the
> > > > > > > > > > > local
> > > > > > > > > > > disk. When it should
> > > > > > > > > > > take less than half a second, which it did up
> > > > > > > > > > > through
> > > > > > > > > > > 5.17.
> > > > > > > > > > > 
> > > > > > > > > > > It doesn't seem to be network related, but can't
> > > > > > > > > > > rule
> > > > > > > > > > > that out
> > > > > > > > > > > completely.
> > > > > > > > > > > 
> > > > > > > > > > > I tried to bisect but the problem can be
> > > > > > > > > > > intermittent. Some
> > > > > > > > > > > runs I'll see a
> > > > > > > > > > > problem in 3 out of 100 cycles, sometimes 0 out
> > > > > > > > > > > of
> > > > > > > > > > > 100.
> > > > > > > > > > > Sometimes I'll see it
> > > > > > > > > > > 100 out of 100.
> > > > > > > > > > 
> > > > > > > > > > It's not clear from your problem report whether the
> > > > > > > > > > problem
> > > > > > > > > > appears
> > > > > > > > > > when it's the server running v5.18-rc or the
> > > > > > > > > > client.
> > > > > > > > > 
> > > > > > > > > That's because I don't know which it is. I'll do a
> > > > > > > > > quick
> > > > > > > > > test and
> > > > > > > > > find out. I
> > > > > > > > > was testing the same kernel across both nodes.
> > > > > > > > 
> > > > > > > > Looks like it is the client.
> > > > > > > > 
> > > > > > > > server  client  result
> > > > > > > > ------  ------  ------
> > > > > > > > 5.17    5.17    Pass
> > > > > > > > 5.17    5.18    Fail
> > > > > > > > 5.18    5.18    Fail
> > > > > > > > 5.18    5.17    Pass
> > > > > > > > 
> > > > > > > > Is there a patch for the client issue you mentioned
> > > > > > > > that I
> > > > > > > > could try?
> > > > > > > > 
> > > > > > > > -Denny
> > > > > > > 
> > > > > > > Try this one
> > > > > > 
> > > > > > Thanks for the patch. Unfortunately it doesn't seem to
> > > > > > solve
> > > > > > the issue, still
> > > > > > see intermittent hangs. I applied it on top of -rc4:
> > > > > > 
> > > > > > copy /mnt/nfs_test/run_nfs_test.junk to
> > > > > > /dev/shm/run_nfs_test.tmp...
> > > > > > 
> > > > > > real    2m6.072s
> > > > > > user    0m0.002s
> > > > > > sys     0m0.263s
> > > > > > Done
> > > > > > 
> > > > > > While it was hung I checked the mem usage on the machine:
> > > > > > 
> > > > > > # free -h
> > > > > >              total        used        free      shared 
> > > > > > buff/cache   available
> > > > > > Mem:           62Gi       871Mi        61Gi      
> > > > > > 342Mi      
> > > > > > 889Mi        61Gi
> > > > > > Swap:         4.0Gi          0B       4.0Gi
> > > > > > 
> > > > > > Doesn't appear to be under memory pressure.
> > > > > 
> > > > > Hi, since you know now that it is the client, perhaps a
> > > > > bisect
> > > > > would be more successful?
> > > > 
> > > > I've been testing all week. I pulled the nfs-rdma tree that was
> > > > sent to Linus
> > > > for 5.18 and tested. I see the problem on pretty much all the
> > > > patches. However
> > > > it's the frequency that it hits which changes.
> > > > 
> > > > I'll see 1-5 cycles out of 2500 where the copy takes minutes up
> > > > to:
> > > > "NFS: Convert readdir page cache to use a cookie based index"
> > > > 
> > > > After this I start seeing it around 10 times in 500 and by the
> > > > last
> > > > patch 10
> > > > times in less than 100.
> > > > 
> > > > Is there any kind of tracing/debugging I could turn on to get
> > > > more
> > > > insight on
> > > > what is taking so long when it does go bad?
> > > > 
> > > 
> > > Ran a test with -rc6 and this time see a hung task trace on the
> > > console as well
> > > as an NFS RPC error.
> > > 
> > > [32719.991175] nfs: RPC call returned error 512
> > > .
> > > .
> > > .
> > > [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
> > > than 122 seconds.
> > > [32933.293543]       Tainted: G S                5.18.0-rc6 #1
> > > [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this
> > > message.
> > > [32933.308740] task:kworker/u145:23 state:D stack:    0
> > > pid:886141
> > > ppid:     2
> > > flags:0x00004000
> > > [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > [32933.324524] Call Trace:
> > > [32933.327347]  <TASK>
> > > [32933.329785]  __schedule+0x3dd/0x970
> > > [32933.333783]  schedule+0x41/0xa0
> > > [32933.337388]  xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
> > > [32933.343639]  ? prepare_to_wait+0xd0/0xd0
> > > [32933.348123]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> > > [32933.354183]  xprt_release+0x26/0x140 [sunrpc]
> > > [32933.359168]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> > > [32933.365225]  rpc_release_resources_task+0xe/0x50 [sunrpc]
> > > [32933.371381]  __rpc_execute+0x2c5/0x4e0 [sunrpc]
> > > [32933.376564]  ? __switch_to_asm+0x42/0x70
> > > [32933.381046]  ? finish_task_switch+0xb2/0x2c0
> > > [32933.385918]  rpc_async_schedule+0x29/0x40 [sunrpc]
> > > [32933.391391]  process_one_work+0x1c8/0x390
> > > [32933.395975]  worker_thread+0x30/0x360
> > > [32933.400162]  ? process_one_work+0x390/0x390
> > > [32933.404931]  kthread+0xd9/0x100
> > > [32933.408536]  ? kthread_complete_and_exit+0x20/0x20
> > > [32933.413984]  ret_from_fork+0x22/0x30
> > > [32933.418074]  </TASK>
> > > 
> > > The call trace shows up again at 245, 368, and 491 seconds. Same
> > > task, same trace.
> > > 
> > > 
> > > 
> > > 
> > 
> > That's very helpful. The above trace suggests that the RDMA code is
> > leaking a call to xprt_unpin_rqst().
> 
> IMHO this is unlikely to be related to the performance
> regression -- none of this code has changed in the past 5
> kernel releases. Could be a different issue, though.
> 
> As is often the case in these situations, the INFO trace
> above happens long after the issue that caused the missing
> unpin. So... unless Dennis has a reproducer that can trigger
> the issue frequently, I don't think there's much that can
> be extracted from that.
> 
> Also "nfs: RPC call returned error 512" suggests someone
> hit ^C at some point. It's always possible that the
> xprt_rdma_free() path is missing an unpin. But again,
> that's not likely to be related to performance.
> 
> 
> > From what I can see, rpcrdma_reply_handler() will always call
> > xprt_pin_rqst(), and that call is required to be matched by a call
> > to
> > xprt_unpin_rqst() before the handler exits. However those calls to
> > xprt_unpin_rqst() appear to be scattered around the code, and it is
> > not
> > at all obvious that the requirement is being fulfilled.
> 
> I don't think unpin is required before rpcrdma_reply_handler()
> exits. It /is/ required after xprt_complete_rqst(), and
> sometimes that call has to be deferred until the NIC is
> fully done with the RPC's buffers. The NIC is often not
> finished with the buffers in the Receive completion handler.
> 
> 
> > Chuck, why does that code need to be so complex?
> 
> The overall goal is that the upper layer cannot be awoken
> until the NIC has finished with the RPC's header and
> payload buffers. These buffers are treated differently
> depending on how the send path chose to marshal the RPC.
> 
> And, all of these are hot paths, so it's important to
> ensure that we are waiting/context-switching as
> infrequently as possible. We wake the upper layer waiter
> during an RDMA completion to reduce context switching.
> The first completion does not always mean there's nothing
> left for the NIC to do.
> 
> Ignoring the server backchannel, there is a single unpin
> in the "complete without error" case in
> rpcrdma_complete_rqst().
> 
> There are three call sites for rpcrdma_complete_rqst:
> 
> 1. rpcrdma_reply_done - called when no explicit invalidation
> is needed and the Receive completion fired after the Send
> completion.
> 
> 2. rpcrdma_sendctx_done - called when no explicit invalidation
> is needed and the Send completion fired after the Receive
> completion.
> 
> 3. frwr_wc_localinv_done - called when explicit invalidation
> was needed and that invalidation has completed. SQ ordering
> guarantees that the Send has also completed at this point,
> so a separate wait for Send completion is not needed.
> 
> These are the points when the NIC has finished with all of
> the memory resources associated with an RPC, so that's
> when it is safe to wake the waiter and unpin the RPC.
> 

>From what I can see, what's important in the case of RDMA is to ensure
that we call exactly 1 of either frwr_unmap_async(), frwr_unmap_sync()
or rpcrdma_complete_rqst(), and that whichever operation we select must
completed before we call xprt->ops->free_slot().

xprt_pin_rqst()/xprt_unpin_rqst() can help to ensure exclusion between
those 3 operations, but it is not required if there are other ways to
do so.
In fact since xprt_wait_on_pinned_rqst() forces a synchronous sleep in
xprt_release(), it would be much preferable to find a better solution
than to pin the request if you concern is performance.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx