Re: NFS server regression in kernel 5.13 (tested w/ 5.13.9)

Mike Javorski <mike.javorski@xxxxxxxxx> · Mon, 16 Aug 2021 09:25:14 -0700

Thanks for the pointer Chuck

I ran the trace capture and was able to trigger two freezes in
relatively short order. Here is that trace file:
https://drive.google.com/file/d/1qk_VIMkj8aTeQIg5O0W3AvWyeDSWL3vW/view?usp=sharing

- mike

On Mon, Aug 16, 2021 at 6:21 AM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>
>
>
> > On Aug 15, 2021, at 9:20 PM, NeilBrown <neilb@xxxxxxx> wrote:
> >
> > On Sun, 15 Aug 2021, Mike Javorski wrote:
> >> I managed to get a cap with several discreet freezes in it, and I
> >> included a chunk with 5 of them in the span of ~3000 frames. I added
> >> packet comments at each frame that the tshark command reported as > 1
> >> sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
> >> details. This is with kernel 5.13.10 provided by Arch (See
> >> https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
> >> for diff vs mainline, nothing NFS/RPC related I can identify).
> >>
> >> I tried unsuccessfully to get any failures with the 5.12.15 kernel.
> >>
> >> https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
> >>
> >> File should be downloadable anonymously.
> >
> > Got it, thanks.
> >
> > There are 3 RPC replies that are more than 1 second after the request.
> > The replies are in frames 983, 1005, 1777 These roughly correspond to
> > where you added the "Freeze" annotation (I didn't know you could do that!).
> >
> > 983:
> >  This is the (Start of the) reply to a READ Request in frame 980.
> >  It is a 128K read.  The whole reply has arrived by frame 1004, 2ms
> >  later.
> >  The request (Frame 980) is followed 13ms later by a TCP retransmit
> >  of the request and the (20usec later) a TCP ACK from the server.
> >
> >  The fact that the client needed to retransmit seems a little odd
> >  but as it is the only retransmit in the whole capture, I don't think
> >  we can read much into it.
> >
> > 1005:
> >  This is the reply to a 128K READ request in frame 793 - earlier than
> >  previous one.
> >  So there were two READ requests, then a 2 second delay then both
> >  replies in reverse order.
> >
> > 1777:
> >  This is a similar READ reply - to 1761.
> >  There were READs in 1760, 1761, and 1775
> >  1760 is replied to almost immediately
> >  1761 gets a reply in 4 seconds (1777)
> >  1775 never gets a reply (in the available packet capture).
> >
> > Looking at other delays ... most READs get a reply in under a millisec.
> > There are about 20 where the reply is more than 1ms - the longest delay
> > not already mentioned is 90ms with reply 1857.
> > The pattern here is
> >   READ req (1)
> >   GETATTR req
> >   GETATTR reply
> >   READ req (2)
> >   READ reply (1)
> >  pause
> >   READ reply (2)
> >
> > I suspect this is the same problem occurring, but it isn't so
> > noticeable.
> >
> > My first thought was that the reply might be getting stuck in the TCP
> > transmit queue on the server, but checking the TSval in the TCP
> > timestamp option shows that - for frame 983 which shows a 2second delay
> > - the TSval is also 2seconds later than the previous packet.  So the
> > delay happens before the TCP-level decision to create the packet.
> >
> > So I cannot see any real evidence to suggest a TCP-level problem.
> > The time of 2 or 4 seconds - and maybe even 90ms - seem unlikely to be
> > caused by an NFSd problem.
> >
> > So my guess is that the delay comes from the filesystem.  Maybe.
> > What filesystem are you exporting?
> >
> > How can we check this? Probably by turning on nfsd tracing.
> > There are a bunch of tracepoints that related to reading:
> >
> >       trace_nfsd_read_start
> >       trace_nfsd_read_done
> >       trace_nfsd_read_io_done
> >       trace_nfsd_read_err
> >       trace_nfsd_read_splice
> >       trace_nfsd_read_vector
> >       trace_nfsd_read_start
> >       trace_nfsd_read_done
> >
> > Maybe enabling them might be useful as you should be able to see if the
> > delay was within one read request, or between two read requests.
> > But I don't have much (if any) experience in enabling trace points.  I
> > really should try that some day.  Maybe you can find guidance on using
> > these tracepoint somewhere ... or maybe you already know how :-)
>
> Try something like:
>
> # trace-cmd record -e nfsd:nfsd_read_\*
>
> When the test is done, ^C the trace-cmd program. The trace records
> are stored in a file called trace.dat. You can view them with:
>
> # trace-cmd report | less
>
> The trace.dat file is portable. It carries the format specifiers
> for all records events, so the trace records can be viewed on
> other systems.
>
>
> --
> Chuck Lever
>
>
>