Re: NFS regression between 5.17 and 5.18

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 13 May 2022 13:30:13 +0000

On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
> On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
> > On 4/29/22 9:37 AM, Chuck Lever III wrote:
> > > 
> > > 
> > > > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
> > > > <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > > 
> > > > On 4/28/22 3:56 PM, Trond Myklebust wrote:
> > > > > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
> > > > > > On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
> > > > > > > On 4/28/22 10:57 AM, Chuck Lever III wrote:
> > > > > > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
> > > > > > > > > <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > > > > 
> > > > > > > > > Hi NFS folks,
> > > > > > > > > 
> > > > > > > > > I've noticed a pretty nasty regression in our NFS
> > > > > > > > > capability
> > > > > > > > > between 5.17 and
> > > > > > > > > 5.18-rc1. I've tried to bisect but not having any
> > > > > > > > > luck. The
> > > > > > > > > problem I'm seeing
> > > > > > > > > is it takes 3 minutes to copy a file from NFS to the
> > > > > > > > > local
> > > > > > > > > disk. When it should
> > > > > > > > > take less than half a second, which it did up through
> > > > > > > > > 5.17.
> > > > > > > > > 
> > > > > > > > > It doesn't seem to be network related, but can't rule
> > > > > > > > > that out
> > > > > > > > > completely.
> > > > > > > > > 
> > > > > > > > > I tried to bisect but the problem can be
> > > > > > > > > intermittent. Some
> > > > > > > > > runs I'll see a
> > > > > > > > > problem in 3 out of 100 cycles, sometimes 0 out of
> > > > > > > > > 100.
> > > > > > > > > Sometimes I'll see it
> > > > > > > > > 100 out of 100.
> > > > > > > > 
> > > > > > > > It's not clear from your problem report whether the
> > > > > > > > problem
> > > > > > > > appears
> > > > > > > > when it's the server running v5.18-rc or the client.
> > > > > > > 
> > > > > > > That's because I don't know which it is. I'll do a quick
> > > > > > > test and
> > > > > > > find out. I
> > > > > > > was testing the same kernel across both nodes.
> > > > > > 
> > > > > > Looks like it is the client.
> > > > > > 
> > > > > > server  client  result
> > > > > > ------  ------  ------
> > > > > > 5.17    5.17    Pass
> > > > > > 5.17    5.18    Fail
> > > > > > 5.18    5.18    Fail
> > > > > > 5.18    5.17    Pass
> > > > > > 
> > > > > > Is there a patch for the client issue you mentioned that I
> > > > > > could try?
> > > > > > 
> > > > > > -Denny
> > > > > 
> > > > > Try this one
> > > > 
> > > > Thanks for the patch. Unfortunately it doesn't seem to solve
> > > > the issue, still
> > > > see intermittent hangs. I applied it on top of -rc4:
> > > > 
> > > > copy /mnt/nfs_test/run_nfs_test.junk to
> > > > /dev/shm/run_nfs_test.tmp...
> > > > 
> > > > real    2m6.072s
> > > > user    0m0.002s
> > > > sys     0m0.263s
> > > > Done
> > > > 
> > > > While it was hung I checked the mem usage on the machine:
> > > > 
> > > > # free -h
> > > >              total        used        free      shared 
> > > > buff/cache   available
> > > > Mem:           62Gi       871Mi        61Gi       342Mi      
> > > > 889Mi        61Gi
> > > > Swap:         4.0Gi          0B       4.0Gi
> > > > 
> > > > Doesn't appear to be under memory pressure.
> > > 
> > > Hi, since you know now that it is the client, perhaps a bisect
> > > would be more successful?
> > 
> > I've been testing all week. I pulled the nfs-rdma tree that was
> > sent to Linus
> > for 5.18 and tested. I see the problem on pretty much all the
> > patches. However
> > it's the frequency that it hits which changes.
> > 
> > I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
> > "NFS: Convert readdir page cache to use a cookie based index"
> > 
> > After this I start seeing it around 10 times in 500 and by the last
> > patch 10
> > times in less than 100.
> > 
> > Is there any kind of tracing/debugging I could turn on to get more
> > insight on
> > what is taking so long when it does go bad?
> > 
> 
> Ran a test with -rc6 and this time see a hung task trace on the
> console as well
> as an NFS RPC error.
> 
> [32719.991175] nfs: RPC call returned error 512
> .
> .
> .
> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
> than 122 seconds.
> [32933.293543]       Tainted: G S                5.18.0-rc6 #1
> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this
> message.
> [32933.308740] task:kworker/u145:23 state:D stack:    0 pid:886141
> ppid:     2
> flags:0x00004000
> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [32933.324524] Call Trace:
> [32933.327347]  <TASK>
> [32933.329785]  __schedule+0x3dd/0x970
> [32933.333783]  schedule+0x41/0xa0
> [32933.337388]  xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
> [32933.343639]  ? prepare_to_wait+0xd0/0xd0
> [32933.348123]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> [32933.354183]  xprt_release+0x26/0x140 [sunrpc]
> [32933.359168]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> [32933.365225]  rpc_release_resources_task+0xe/0x50 [sunrpc]
> [32933.371381]  __rpc_execute+0x2c5/0x4e0 [sunrpc]
> [32933.376564]  ? __switch_to_asm+0x42/0x70
> [32933.381046]  ? finish_task_switch+0xb2/0x2c0
> [32933.385918]  rpc_async_schedule+0x29/0x40 [sunrpc]
> [32933.391391]  process_one_work+0x1c8/0x390
> [32933.395975]  worker_thread+0x30/0x360
> [32933.400162]  ? process_one_work+0x390/0x390
> [32933.404931]  kthread+0xd9/0x100
> [32933.408536]  ? kthread_complete_and_exit+0x20/0x20
> [32933.413984]  ret_from_fork+0x22/0x30
> [32933.418074]  </TASK>
> 
> The call trace shows up again at 245, 368, and 491 seconds. Same
> task, same trace.
> 
> 
> 
> 

That's very helpful. The above trace suggests that the RDMA code is
leaking a call to xprt_unpin_rqst().

>From what I can see, rpcrdma_reply_handler() will always call
xprt_pin_rqst(), and that call is required to be matched by a call to
xprt_unpin_rqst() before the handler exits. However those calls to
xprt_unpin_rqst() appear to be scattered around the code, and it is not
at all obvious that the requirement is being fulfilled.

Chuck, why does that code need to be so complex?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx