Re: NFS regression between 5.17 and 5.18

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 13 May 2022 18:53:12 +0000

> On May 13, 2022, at 11:19 AM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2022-05-13 at 14:59 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On May 13, 2022, at 9:30 AM, Trond Myklebust
>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
>>>> On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
>>>>> On 4/29/22 9:37 AM, Chuck Lever III wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
>>>>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>>>>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro
>>>>>>>> wrote:
>>>>>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>>>>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi NFS folks,
>>>>>>>>>>>> 
>>>>>>>>>>>> I've noticed a pretty nasty regression in our NFS
>>>>>>>>>>>> capability
>>>>>>>>>>>> between 5.17 and
>>>>>>>>>>>> 5.18-rc1. I've tried to bisect but not having any
>>>>>>>>>>>> luck. The
>>>>>>>>>>>> problem I'm seeing
>>>>>>>>>>>> is it takes 3 minutes to copy a file from NFS to
>>>>>>>>>>>> the
>>>>>>>>>>>> local
>>>>>>>>>>>> disk. When it should
>>>>>>>>>>>> take less than half a second, which it did up
>>>>>>>>>>>> through
>>>>>>>>>>>> 5.17.
>>>>>>>>>>>> 
>>>>>>>>>>>> It doesn't seem to be network related, but can't
>>>>>>>>>>>> rule
>>>>>>>>>>>> that out
>>>>>>>>>>>> completely.
>>>>>>>>>>>> 
>>>>>>>>>>>> I tried to bisect but the problem can be
>>>>>>>>>>>> intermittent. Some
>>>>>>>>>>>> runs I'll see a
>>>>>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out
>>>>>>>>>>>> of
>>>>>>>>>>>> 100.
>>>>>>>>>>>> Sometimes I'll see it
>>>>>>>>>>>> 100 out of 100.
>>>>>>>>>>> 
>>>>>>>>>>> It's not clear from your problem report whether the
>>>>>>>>>>> problem
>>>>>>>>>>> appears
>>>>>>>>>>> when it's the server running v5.18-rc or the
>>>>>>>>>>> client.
>>>>>>>>>> 
>>>>>>>>>> That's because I don't know which it is. I'll do a
>>>>>>>>>> quick
>>>>>>>>>> test and
>>>>>>>>>> find out. I
>>>>>>>>>> was testing the same kernel across both nodes.
>>>>>>>>> 
>>>>>>>>> Looks like it is the client.
>>>>>>>>> 
>>>>>>>>> server  client  result
>>>>>>>>> ------  ------  ------
>>>>>>>>> 5.17    5.17    Pass
>>>>>>>>> 5.17    5.18    Fail
>>>>>>>>> 5.18    5.18    Fail
>>>>>>>>> 5.18    5.17    Pass
>>>>>>>>> 
>>>>>>>>> Is there a patch for the client issue you mentioned
>>>>>>>>> that I
>>>>>>>>> could try?
>>>>>>>>> 
>>>>>>>>> -Denny
>>>>>>>> 
>>>>>>>> Try this one
>>>>>>> 
>>>>>>> Thanks for the patch. Unfortunately it doesn't seem to
>>>>>>> solve
>>>>>>> the issue, still
>>>>>>> see intermittent hangs. I applied it on top of -rc4:
>>>>>>> 
>>>>>>> copy /mnt/nfs_test/run_nfs_test.junk to
>>>>>>> /dev/shm/run_nfs_test.tmp...
>>>>>>> 
>>>>>>> real    2m6.072s
>>>>>>> user    0m0.002s
>>>>>>> sys     0m0.263s
>>>>>>> Done
>>>>>>> 
>>>>>>> While it was hung I checked the mem usage on the machine:
>>>>>>> 
>>>>>>> # free -h
>>>>>>>              total        used        free      shared 
>>>>>>> buff/cache   available
>>>>>>> Mem:           62Gi       871Mi        61Gi      
>>>>>>> 342Mi      
>>>>>>> 889Mi        61Gi
>>>>>>> Swap:         4.0Gi          0B       4.0Gi
>>>>>>> 
>>>>>>> Doesn't appear to be under memory pressure.
>>>>>> 
>>>>>> Hi, since you know now that it is the client, perhaps a
>>>>>> bisect
>>>>>> would be more successful?
>>>>> 
>>>>> I've been testing all week. I pulled the nfs-rdma tree that was
>>>>> sent to Linus
>>>>> for 5.18 and tested. I see the problem on pretty much all the
>>>>> patches. However
>>>>> it's the frequency that it hits which changes.
>>>>> 
>>>>> I'll see 1-5 cycles out of 2500 where the copy takes minutes up
>>>>> to:
>>>>> "NFS: Convert readdir page cache to use a cookie based index"
>>>>> 
>>>>> After this I start seeing it around 10 times in 500 and by the
>>>>> last
>>>>> patch 10
>>>>> times in less than 100.
>>>>> 
>>>>> Is there any kind of tracing/debugging I could turn on to get
>>>>> more
>>>>> insight on
>>>>> what is taking so long when it does go bad?
>>>>> 
>>>> 
>>>> Ran a test with -rc6 and this time see a hung task trace on the
>>>> console as well
>>>> as an NFS RPC error.
>>>> 
>>>> [32719.991175] nfs: RPC call returned error 512
>>>> .
>>>> .
>>>> .
>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>>>> than 122 seconds.
>>>> [32933.293543]       Tainted: G S                5.18.0-rc6 #1
>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this
>>>> message.
>>>> [32933.308740] task:kworker/u145:23 state:D stack:    0
>>>> pid:886141
>>>> ppid:     2
>>>> flags:0x00004000
>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>> [32933.324524] Call Trace:
>>>> [32933.327347]  <TASK>
>>>> [32933.329785]  __schedule+0x3dd/0x970
>>>> [32933.333783]  schedule+0x41/0xa0
>>>> [32933.337388]  xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>>>> [32933.343639]  ? prepare_to_wait+0xd0/0xd0
>>>> [32933.348123]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>> [32933.354183]  xprt_release+0x26/0x140 [sunrpc]
>>>> [32933.359168]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>> [32933.365225]  rpc_release_resources_task+0xe/0x50 [sunrpc]
>>>> [32933.371381]  __rpc_execute+0x2c5/0x4e0 [sunrpc]
>>>> [32933.376564]  ? __switch_to_asm+0x42/0x70
>>>> [32933.381046]  ? finish_task_switch+0xb2/0x2c0
>>>> [32933.385918]  rpc_async_schedule+0x29/0x40 [sunrpc]
>>>> [32933.391391]  process_one_work+0x1c8/0x390
>>>> [32933.395975]  worker_thread+0x30/0x360
>>>> [32933.400162]  ? process_one_work+0x390/0x390
>>>> [32933.404931]  kthread+0xd9/0x100
>>>> [32933.408536]  ? kthread_complete_and_exit+0x20/0x20
>>>> [32933.413984]  ret_from_fork+0x22/0x30
>>>> [32933.418074]  </TASK>
>>>> 
>>>> The call trace shows up again at 245, 368, and 491 seconds. Same
>>>> task, same trace.
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> That's very helpful. The above trace suggests that the RDMA code is
>>> leaking a call to xprt_unpin_rqst().
>> 
>> IMHO this is unlikely to be related to the performance
>> regression -- none of this code has changed in the past 5
>> kernel releases. Could be a different issue, though.
>> 
>> As is often the case in these situations, the INFO trace
>> above happens long after the issue that caused the missing
>> unpin. So... unless Dennis has a reproducer that can trigger
>> the issue frequently, I don't think there's much that can
>> be extracted from that.
>> 
>> Also "nfs: RPC call returned error 512" suggests someone
>> hit ^C at some point. It's always possible that the
>> xprt_rdma_free() path is missing an unpin. But again,
>> that's not likely to be related to performance.
>> 
>> 
>>> From what I can see, rpcrdma_reply_handler() will always call
>>> xprt_pin_rqst(), and that call is required to be matched by a call
>>> to
>>> xprt_unpin_rqst() before the handler exits. However those calls to
>>> xprt_unpin_rqst() appear to be scattered around the code, and it is
>>> not
>>> at all obvious that the requirement is being fulfilled.
>> 
>> I don't think unpin is required before rpcrdma_reply_handler()
>> exits. It /is/ required after xprt_complete_rqst(), and
>> sometimes that call has to be deferred until the NIC is
>> fully done with the RPC's buffers. The NIC is often not
>> finished with the buffers in the Receive completion handler.
>> 
>> 
>>> Chuck, why does that code need to be so complex?
>> 
>> The overall goal is that the upper layer cannot be awoken
>> until the NIC has finished with the RPC's header and
>> payload buffers. These buffers are treated differently
>> depending on how the send path chose to marshal the RPC.
>> 
>> And, all of these are hot paths, so it's important to
>> ensure that we are waiting/context-switching as
>> infrequently as possible. We wake the upper layer waiter
>> during an RDMA completion to reduce context switching.
>> The first completion does not always mean there's nothing
>> left for the NIC to do.
>> 
>> Ignoring the server backchannel, there is a single unpin
>> in the "complete without error" case in
>> rpcrdma_complete_rqst().
>> 
>> There are three call sites for rpcrdma_complete_rqst:
>> 
>> 1. rpcrdma_reply_done - called when no explicit invalidation
>> is needed and the Receive completion fired after the Send
>> completion.
>> 
>> 2. rpcrdma_sendctx_done - called when no explicit invalidation
>> is needed and the Send completion fired after the Receive
>> completion.
>> 
>> 3. frwr_wc_localinv_done - called when explicit invalidation
>> was needed and that invalidation has completed. SQ ordering
>> guarantees that the Send has also completed at this point,
>> so a separate wait for Send completion is not needed.
>> 
>> These are the points when the NIC has finished with all of
>> the memory resources associated with an RPC, so that's
>> when it is safe to wake the waiter and unpin the RPC.
>> 
> 
> From what I can see, what's important in the case of RDMA is to ensure
> that we call exactly 1 of either frwr_unmap_async(), frwr_unmap_sync()
> or rpcrdma_complete_rqst(),

No, all of these mechanisms are supposed to end up at
rpcrdma_complete_rqst(). That's where the RPC/RDMA transport
header is parsed.

Reply completion processing is kicked off by a Receive completion.
If local invalidation is unneeded, then we have to wait explicitly
for the Send completion too. Otherwise, a LOCAL_INV is posted and
we wait for that to complete; that completion means the previous
Send is complete, so waiting for Send completion in that case is
unnecessary.

When all of that dust settles, rpcrdma_complete_rqst() is called
to wake up the upper layer. That's basically how xprtsock did it
when I wrote all this years ago.

(The Send completion is needed because the NIC is free to continue
to read from sndbuf until then. We can't release that buffer until
completion because otherwise a NIC retransmission could expose
the different contents on the wire, or worse, the receiving end
would see garbage. Works a bit like TCP retransmission).

> and that whichever operation we select must
> complete before we call xprt->ops->free_slot().

The NIC has to tell xprtrdma that its memory operations are
complete before RPC can XDR decode either the RPC header or
the NFS request itself. That's well before ->free_slot(), IIRC.

> xprt_pin_rqst()/xprt_unpin_rqst() can help to ensure exclusion between
> those 3 operations, but it is not required if there are other ways to
> do so.
> In fact since xprt_wait_on_pinned_rqst() forces a synchronous sleep in
> xprt_release(), it would be much preferable to find a better solution
> than to pin the request if you concern is performance.

In particular, my performance concern is to avoid waits and
context switches, because those add latency, sometimes quite a
bit. I've tried to build the completion process on only the
unavoidable context switches. I don't immediately see how to
do this by adding wait_on_pinned to the fray.

frwr_unmap_sync() is used only to deal with ^C, so it can be
ugly and/or slow.

I am open to discussion about the structure of this code. The
simpler the better, but there are lots of moving parts. So
feedback is always welcome.

--
Chuck Lever