Re: NFS regression between 5.17 and 5.18

Dennis Dalessandro <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> · Fri, 6 May 2022 09:24:03 -0400

On 4/29/22 9:37 AM, Chuck Lever III wrote:
> 
> 
>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Hi NFS folks,
>>>>>>>
>>>>>>> I've noticed a pretty nasty regression in our NFS capability
>>>>>>> between 5.17 and
>>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The
>>>>>>> problem I'm seeing
>>>>>>> is it takes 3 minutes to copy a file from NFS to the local
>>>>>>> disk. When it should
>>>>>>> take less than half a second, which it did up through 5.17.
>>>>>>>
>>>>>>> It doesn't seem to be network related, but can't rule that out
>>>>>>> completely.
>>>>>>>
>>>>>>> I tried to bisect but the problem can be intermittent. Some
>>>>>>> runs I'll see a
>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100.
>>>>>>> Sometimes I'll see it
>>>>>>> 100 out of 100.
>>>>>>
>>>>>> It's not clear from your problem report whether the problem
>>>>>> appears
>>>>>> when it's the server running v5.18-rc or the client.
>>>>>
>>>>> That's because I don't know which it is. I'll do a quick test and
>>>>> find out. I
>>>>> was testing the same kernel across both nodes.
>>>>
>>>> Looks like it is the client.
>>>>
>>>> server  client  result
>>>> ------  ------  ------
>>>> 5.17    5.17    Pass
>>>> 5.17    5.18    Fail
>>>> 5.18    5.18    Fail
>>>> 5.18    5.17    Pass
>>>>
>>>> Is there a patch for the client issue you mentioned that I could try?
>>>>
>>>> -Denny
>>>
>>> Try this one
>>
>> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still
>> see intermittent hangs. I applied it on top of -rc4:
>>
>> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...
>>
>> real    2m6.072s
>> user    0m0.002s
>> sys     0m0.263s
>> Done
>>
>> While it was hung I checked the mem usage on the machine:
>>
>> # free -h
>>              total        used        free      shared  buff/cache   available
>> Mem:           62Gi       871Mi        61Gi       342Mi       889Mi        61Gi
>> Swap:         4.0Gi          0B       4.0Gi
>>
>> Doesn't appear to be under memory pressure.
> 
> Hi, since you know now that it is the client, perhaps a bisect
> would be more successful?

I've been testing all week. I pulled the nfs-rdma tree that was sent to Linus
for 5.18 and tested. I see the problem on pretty much all the patches. However
it's the frequency that it hits which changes.

I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
"NFS: Convert readdir page cache to use a cookie based index"

After this I start seeing it around 10 times in 500 and by the last patch 10
times in less than 100.

Is there any kind of tracing/debugging I could turn on to get more insight on
what is taking so long when it does go bad?

-Denny