Re: NFS regression between 5.17 and 5.18

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On May 13, 2022, at 9:30 AM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
>> On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
>>> On 4/29/22 9:37 AM, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
>>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>>>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi NFS folks,
>>>>>>>>>> 
>>>>>>>>>> I've noticed a pretty nasty regression in our NFS
>>>>>>>>>> capability
>>>>>>>>>> between 5.17 and
>>>>>>>>>> 5.18-rc1. I've tried to bisect but not having any
>>>>>>>>>> luck. The
>>>>>>>>>> problem I'm seeing
>>>>>>>>>> is it takes 3 minutes to copy a file from NFS to the
>>>>>>>>>> local
>>>>>>>>>> disk. When it should
>>>>>>>>>> take less than half a second, which it did up through
>>>>>>>>>> 5.17.
>>>>>>>>>> 
>>>>>>>>>> It doesn't seem to be network related, but can't rule
>>>>>>>>>> that out
>>>>>>>>>> completely.
>>>>>>>>>> 
>>>>>>>>>> I tried to bisect but the problem can be
>>>>>>>>>> intermittent. Some
>>>>>>>>>> runs I'll see a
>>>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of
>>>>>>>>>> 100.
>>>>>>>>>> Sometimes I'll see it
>>>>>>>>>> 100 out of 100.
>>>>>>>>> 
>>>>>>>>> It's not clear from your problem report whether the
>>>>>>>>> problem
>>>>>>>>> appears
>>>>>>>>> when it's the server running v5.18-rc or the client.
>>>>>>>> 
>>>>>>>> That's because I don't know which it is. I'll do a quick
>>>>>>>> test and
>>>>>>>> find out. I
>>>>>>>> was testing the same kernel across both nodes.
>>>>>>> 
>>>>>>> Looks like it is the client.
>>>>>>> 
>>>>>>> server  client  result
>>>>>>> ------  ------  ------
>>>>>>> 5.17    5.17    Pass
>>>>>>> 5.17    5.18    Fail
>>>>>>> 5.18    5.18    Fail
>>>>>>> 5.18    5.17    Pass
>>>>>>> 
>>>>>>> Is there a patch for the client issue you mentioned that I
>>>>>>> could try?
>>>>>>> 
>>>>>>> -Denny
>>>>>> 
>>>>>> Try this one
>>>>> 
>>>>> Thanks for the patch. Unfortunately it doesn't seem to solve
>>>>> the issue, still
>>>>> see intermittent hangs. I applied it on top of -rc4:
>>>>> 
>>>>> copy /mnt/nfs_test/run_nfs_test.junk to
>>>>> /dev/shm/run_nfs_test.tmp...
>>>>> 
>>>>> real    2m6.072s
>>>>> user    0m0.002s
>>>>> sys     0m0.263s
>>>>> Done
>>>>> 
>>>>> While it was hung I checked the mem usage on the machine:
>>>>> 
>>>>> # free -h
>>>>>              total        used        free      shared 
>>>>> buff/cache   available
>>>>> Mem:           62Gi       871Mi        61Gi       342Mi      
>>>>> 889Mi        61Gi
>>>>> Swap:         4.0Gi          0B       4.0Gi
>>>>> 
>>>>> Doesn't appear to be under memory pressure.
>>>> 
>>>> Hi, since you know now that it is the client, perhaps a bisect
>>>> would be more successful?
>>> 
>>> I've been testing all week. I pulled the nfs-rdma tree that was
>>> sent to Linus
>>> for 5.18 and tested. I see the problem on pretty much all the
>>> patches. However
>>> it's the frequency that it hits which changes.
>>> 
>>> I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
>>> "NFS: Convert readdir page cache to use a cookie based index"
>>> 
>>> After this I start seeing it around 10 times in 500 and by the last
>>> patch 10
>>> times in less than 100.
>>> 
>>> Is there any kind of tracing/debugging I could turn on to get more
>>> insight on
>>> what is taking so long when it does go bad?
>>> 
>> 
>> Ran a test with -rc6 and this time see a hung task trace on the
>> console as well
>> as an NFS RPC error.
>> 
>> [32719.991175] nfs: RPC call returned error 512
>> .
>> .
>> .
>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>> than 122 seconds.
>> [32933.293543]       Tainted: G S                5.18.0-rc6 #1
>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this
>> message.
>> [32933.308740] task:kworker/u145:23 state:D stack:    0 pid:886141
>> ppid:     2
>> flags:0x00004000
>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>> [32933.324524] Call Trace:
>> [32933.327347]  <TASK>
>> [32933.329785]  __schedule+0x3dd/0x970
>> [32933.333783]  schedule+0x41/0xa0
>> [32933.337388]  xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>> [32933.343639]  ? prepare_to_wait+0xd0/0xd0
>> [32933.348123]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>> [32933.354183]  xprt_release+0x26/0x140 [sunrpc]
>> [32933.359168]  ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>> [32933.365225]  rpc_release_resources_task+0xe/0x50 [sunrpc]
>> [32933.371381]  __rpc_execute+0x2c5/0x4e0 [sunrpc]
>> [32933.376564]  ? __switch_to_asm+0x42/0x70
>> [32933.381046]  ? finish_task_switch+0xb2/0x2c0
>> [32933.385918]  rpc_async_schedule+0x29/0x40 [sunrpc]
>> [32933.391391]  process_one_work+0x1c8/0x390
>> [32933.395975]  worker_thread+0x30/0x360
>> [32933.400162]  ? process_one_work+0x390/0x390
>> [32933.404931]  kthread+0xd9/0x100
>> [32933.408536]  ? kthread_complete_and_exit+0x20/0x20
>> [32933.413984]  ret_from_fork+0x22/0x30
>> [32933.418074]  </TASK>
>> 
>> The call trace shows up again at 245, 368, and 491 seconds. Same
>> task, same trace.
>> 
>> 
>> 
>> 
> 
> That's very helpful. The above trace suggests that the RDMA code is
> leaking a call to xprt_unpin_rqst().

IMHO this is unlikely to be related to the performance
regression -- none of this code has changed in the past 5
kernel releases. Could be a different issue, though.

As is often the case in these situations, the INFO trace
above happens long after the issue that caused the missing
unpin. So... unless Dennis has a reproducer that can trigger
the issue frequently, I don't think there's much that can
be extracted from that.

Also "nfs: RPC call returned error 512" suggests someone
hit ^C at some point. It's always possible that the
xprt_rdma_free() path is missing an unpin. But again,
that's not likely to be related to performance.


> From what I can see, rpcrdma_reply_handler() will always call
> xprt_pin_rqst(), and that call is required to be matched by a call to
> xprt_unpin_rqst() before the handler exits. However those calls to
> xprt_unpin_rqst() appear to be scattered around the code, and it is not
> at all obvious that the requirement is being fulfilled.

I don't think unpin is required before rpcrdma_reply_handler()
exits. It /is/ required after xprt_complete_rqst(), and
sometimes that call has to be deferred until the NIC is
fully done with the RPC's buffers. The NIC is often not
finished with the buffers in the Receive completion handler.


> Chuck, why does that code need to be so complex?

The overall goal is that the upper layer cannot be awoken
until the NIC has finished with the RPC's header and
payload buffers. These buffers are treated differently
depending on how the send path chose to marshal the RPC.

And, all of these are hot paths, so it's important to
ensure that we are waiting/context-switching as
infrequently as possible. We wake the upper layer waiter
during an RDMA completion to reduce context switching.
The first completion does not always mean there's nothing
left for the NIC to do.

Ignoring the server backchannel, there is a single unpin
in the "complete without error" case in
rpcrdma_complete_rqst().

There are three call sites for rpcrdma_complete_rqst:

1. rpcrdma_reply_done - called when no explicit invalidation
is needed and the Receive completion fired after the Send
completion.

2. rpcrdma_sendctx_done - called when no explicit invalidation
is needed and the Send completion fired after the Receive
completion.

3. frwr_wc_localinv_done - called when explicit invalidation
was needed and that invalidation has completed. SQ ordering
guarantees that the Send has also completed at this point,
so a separate wait for Send completion is not needed.

These are the points when the NIC has finished with all of
the memory resources associated with an RPC, so that's
when it is safe to wake the waiter and unpin the RPC.

--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux