Re: NFS regression between 5.17 and 5.18

Olga Kornievskaia <aglo@xxxxxxxxx> · Tue, 21 Jun 2022 13:51:36 -0400

On Tue, Jun 21, 2022 at 12:58 PM Dennis Dalessandro
<dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On 6/21/22 12:04 PM, Olga Kornievskaia wrote:
> > Hi Dennis,
> >
> > Can I ask some basic questions? Have you tried to get any kinds of
> > profiling done to see where the client is spending time (using perf
> > perhaps)?
> >
> > real    4m11.835s
> > user    0m0.001s
> > sys     0m0.277s
> >
> > sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do
> > a network transfer (if we had a network trace we could see how long
> > network transfer were)? Do you have one (that goes along with
> > something that can tell us approximately when the request began from
> > the cp's perspective, like a date before hand)?
> >
> > I see that there were no rdma changes that went into 5.18 kernel so
> > whatever changed either a generic nfs behaviour or perhaps something
> > in the rdma core code (is an mellonax card being used here?)
> >
> > I wonder if the slowdown only happens on rdma or is it visible on the
> > tcp mount as well, have you tried?
> >
>
> Hi Olga,
>
> I have opened a Kernel Bugzilla if you would rather log future responses there:
> https://bugzilla.kernel.org/show_bug.cgi?id=216160
>
> To answer your above questions: This is on Omni-Path hardware. I have not tried
> the TCP mount, I can though. I don't have any network trace per-se or a profile.
> We don't support like a TCP dump or anything like that. However I can tell you
> there is nothing going over the network while it appears to be hung. I can
> monitor the packet counters.

In this thread there are 2 problems raised (1) performance regression
and (2) a single time run that hit a hung client.

For #1, given that there were no rdma changes that were added to the
5.18, seems like something in the generic nfs is causing issues for
you, thus I recommend to first use a linux profiler to get some
information about the times spent in kernel functions that are
triggered by the cp command. If you can't run the profiler (which I
think you should be able to), then perhaps just enable the nfs4 and
rpcrdma tracepoints which also have timestamps in them and looking at
differences can give some clue where the time is being spent. 4min is
a significant chunk of time and should be visible somewhere in those
timestamps.

For #2, I have personally ran into that stack trace while
investigating a hung using soft iWarp as the rdma provider. It was an
unpinned request but I think it was due to soft iWarp failure that
cause it not to do a completion to rdma which led to the request never
getting unpinned. Thus I would recommend looking into failures in your
rdma provider for clues on that problem.

> If you have some ideas where I could put some trace points that could tell us
> something I can certainly add those.
>
> -Denny
>
> >
> >
> > On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro
> > <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote:
> >>
> >> On 6/20/22 10:40 AM, Chuck Lever III wrote:
> >>> Hi Thorsten-
> >>>
> >>>> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 20.06.22 16:11, Chuck Lever III wrote:
> >>>>>
> >>>>>
> >>>>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> Dennis, Chuck, I have below issue on the list of tracked regressions.
> >>>>>> What's the status? Has any progress been made? Or is this not really a
> >>>>>> regression and can be ignored?
> >>>>>>
> >>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>>>
> >>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> >>>>>> reports and sometimes miss something important when writing mails like
> >>>>>> this. If that's the case here, don't hesitate to tell me in a public
> >>>>>> reply, it's in everyone's interest to set the public record straight.
> >>>>>>
> >>>>>> #regzbot poke
> >>>>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
> >>>>>
> >>>>> The above link points to an Apple trackpad bug.
> >>>>
> >>>> Yeah, I know, sorry, should have mentioned: either I or my bot did
> >>>> something stupid and associated that report with this regression, that's
> >>>> why I deassociated it with the "unlink" command.
> >>>
> >>> Is there an open bugzilla for the original regression?
> >>>
> >>>
> >>>>> The bug described all the way at the bottom was the origin problem
> >>>>> report. I believe this is an NFS client issue. We are waiting for
> >>>>> a response from the NFS client maintainers to help Dennis track
> >>>>> this down.
> >>>>
> >>>> Many thx for the status update. Can anything be done to speed things up?
> >>>> This is taken quite a long time already -- way longer that outlined in
> >>>> "Prioritize work on fixing regressions" here:
> >>>> https://docs.kernel.org/process/handling-regressions.html
> >>>
> >>> ENOTMYMONKEYS ;-)
> >>>
> >>> I was involved to help with the ^C issue that happened while
> >>> Dennis was troubleshooting. It's not related to the original
> >>> regression, which needs to be pursued by the NFS client
> >>> maintainers.
> >>>
> >>> The correct people to poke are Trond, Olga (both cc'd) and
> >>> Anna Schumaker.
> >>
> >> Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a
> >> result of the test we were running taking too long. It times out after 10
> >> minutes or so and kills the process. So a downstream effect of the regression.
> >>
> >> The test is still continuing to fail as of 5.19-rc2. I'll double check that it's
> >> the same issue and open a bugzilla.
> >>
> >> Thanks for poking at this.
> >>
> >> -Denny