Re: NFS regression between 5.17 and 5.18

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/21/22 12:04 PM, Olga Kornievskaia wrote:
> Hi Dennis,
> 
> Can I ask some basic questions? Have you tried to get any kinds of
> profiling done to see where the client is spending time (using perf
> perhaps)?
> 
> real    4m11.835s
> user    0m0.001s
> sys     0m0.277s
> 
> sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do
> a network transfer (if we had a network trace we could see how long
> network transfer were)? Do you have one (that goes along with
> something that can tell us approximately when the request began from
> the cp's perspective, like a date before hand)?
> 
> I see that there were no rdma changes that went into 5.18 kernel so
> whatever changed either a generic nfs behaviour or perhaps something
> in the rdma core code (is an mellonax card being used here?)
> 
> I wonder if the slowdown only happens on rdma or is it visible on the
> tcp mount as well, have you tried?
> 

Hi Olga,

I have opened a Kernel Bugzilla if you would rather log future responses there:
https://bugzilla.kernel.org/show_bug.cgi?id=216160

To answer your above questions: This is on Omni-Path hardware. I have not tried
the TCP mount, I can though. I don't have any network trace per-se or a profile.
We don't support like a TCP dump or anything like that. However I can tell you
there is nothing going over the network while it appears to be hung. I can
monitor the packet counters.

If you have some ideas where I could put some trace points that could tell us
something I can certainly add those.

-Denny

> 
> 
> On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro
> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> On 6/20/22 10:40 AM, Chuck Lever III wrote:
>>> Hi Thorsten-
>>>
>>>> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> wrote:
>>>>
>>>> On 20.06.22 16:11, Chuck Lever III wrote:
>>>>>
>>>>>
>>>>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> Dennis, Chuck, I have below issue on the list of tracked regressions.
>>>>>> What's the status? Has any progress been made? Or is this not really a
>>>>>> regression and can be ignored?
>>>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>
>>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>>>>>> reports and sometimes miss something important when writing mails like
>>>>>> this. If that's the case here, don't hesitate to tell me in a public
>>>>>> reply, it's in everyone's interest to set the public record straight.
>>>>>>
>>>>>> #regzbot poke
>>>>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
>>>>>
>>>>> The above link points to an Apple trackpad bug.
>>>>
>>>> Yeah, I know, sorry, should have mentioned: either I or my bot did
>>>> something stupid and associated that report with this regression, that's
>>>> why I deassociated it with the "unlink" command.
>>>
>>> Is there an open bugzilla for the original regression?
>>>
>>>
>>>>> The bug described all the way at the bottom was the origin problem
>>>>> report. I believe this is an NFS client issue. We are waiting for
>>>>> a response from the NFS client maintainers to help Dennis track
>>>>> this down.
>>>>
>>>> Many thx for the status update. Can anything be done to speed things up?
>>>> This is taken quite a long time already -- way longer that outlined in
>>>> "Prioritize work on fixing regressions" here:
>>>> https://docs.kernel.org/process/handling-regressions.html
>>>
>>> ENOTMYMONKEYS ;-)
>>>
>>> I was involved to help with the ^C issue that happened while
>>> Dennis was troubleshooting. It's not related to the original
>>> regression, which needs to be pursued by the NFS client
>>> maintainers.
>>>
>>> The correct people to poke are Trond, Olga (both cc'd) and
>>> Anna Schumaker.
>>
>> Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a
>> result of the test we were running taking too long. It times out after 10
>> minutes or so and kills the process. So a downstream effect of the regression.
>>
>> The test is still continuing to fail as of 5.19-rc2. I'll double check that it's
>> the same issue and open a bugzilla.
>>
>> Thanks for poking at this.
>>
>> -Denny



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux