Re: NFS server regression in kernel 5.13 (tested w/ 5.13.9)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 10 Aug 2021, Mike Javorski wrote:
> Managed to get the rpcdebug info for some freezes today. I had a loop
> running to call the rpcbind command that Neil provided and then sleep
> for 2 seconds. Each triggering event was the opening of a new video
> file (I was shuffling through files with mpv). I haven't managed to
> get a tcp capture yet trying to get one that isn't too massive, but
> the timing is tricky, and this doesn't happen every time.
> 

Thanks for the logs.

They show consistent "a:call_status" and "q:xprt_pending".  They also
show a timeout in the range 10000-20000.  This number is in 'jiffies',
which is normally millisecs, so 10 to 20 seconds.

That means the RPC layer (thinks it has) sent a request (made a call)
and is waiting for a reply to arrive over the network transport (i.e.
the TCP connection).  However that would typically involve a much longer
timeout.  For NFSv3 and 4.0, 60 seconds is the norm.  For NFSv4.1 I
*think* there is no timeout as NFS never resends requests unless the TCP
connection is broken.  So I don't know what to make of that.

Given the fact that it recovers in 5-10 seconds, it seems very likely
that it actually has sent the request (presumably the timeout doesn't
expire).  Mostly likely the server is being delayed in replying.  A
tcpdump trace would confirm this.

I don't think you've said anything about what NFS server you are using. 
It is Linux, or something else?  If Linux, what kernel do you run there?

One thing that might cause a delay when accessing a file is if some
other client has a 'delegation' and takes a while to return it.
e.g.
  client2 creates a file for writing
  server gives it a 'write delegation'
  client2 queues up lots of write request
  client1 asks to read the file.
  server says to client2 "please return the delegation"
  client2 starts flushing its data, but either takes too long or 
   has some other hiccup. Eventually it is returned, or the server
   revokes it
  server tells clients one it can read now.

This is leading to me asking: do you have other NFS clients that might
be accessing the same file, or might have accessed it recently?

NeilBrown


> -----------------------------
> 
> [31229.965368] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31229.965370] 44224 c801      0 8c478d70 2473aa3b    15395  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31229.965411] 44225 c801      0 8c478d70 a7d3b9b8    15395  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31231.966982] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31231.966986] 44224 c801      0 8c478d70 2473aa3b    14795  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31231.967027] 44225 c801      0 8c478d70 a7d3b9b8    14795  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31233.968741] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31233.968745] 44224 c801      0 8c478d70 2473aa3b    14194  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31233.968787] 44225 c801      0 8c478d70 a7d3b9b8    14194  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> 
> -----------------------------
> 
> [31292.534529] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31292.534530] 45744 c801      0 8c478d70 2473aa3b    15340  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31292.534554] 45745 c801      0 8c478d70 50cbef7b    15342  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31294.536203] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31294.536206] 45744 c801      0 8c478d70 2473aa3b    14740  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31294.536232] 45745 c801      0 8c478d70 50cbef7b    14742  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31296.537826] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31296.537830] 45744 c801      0 8c478d70 2473aa3b    14139  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31296.537871] 45745 c801      0 8c478d70 50cbef7b    14141  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31298.539474] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31298.539478] 45744 c801      0 8c478d70 2473aa3b    13539  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31298.539522] 45745 c801      0 8c478d70 50cbef7b    13541  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31300.541040] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31300.541044] 46048 c801      0 8c478d70 50cbef7b    17965  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31300.541108] 46049 c801      0 8c478d70 2473aa3b    17965  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31302.542632] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31302.542635] 46922 c801      0 8c478d70 50cbef7b    18000  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> 
> ---------------------------
> 
> [31375.046020] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31375.046022] 47336 c801      0 8c478d70 cebaa9a6    15964  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31375.046048] 47337 c801      0 8c478d70 f7145d4d    15964  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31377.047361] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31377.047365] 47336 c801      0 8c478d70 cebaa9a6    15364  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31377.047406] 47337 c801      0 8c478d70 f7145d4d    15364  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31379.048801] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31379.048805] 47336 c801      0 8c478d70 cebaa9a6    14764  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31379.048845] 47337 c801      0 8c478d70 f7145d4d    14764  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31381.050376] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31381.050380] 47336 c801      0 8c478d70 cebaa9a6    14163  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31381.050421] 47337 c801      0 8c478d70 f7145d4d    14163  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31383.051746] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31383.051749] 47336 c801      0 8c478d70 cebaa9a6    13563  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31383.051775] 47337 c801      0 8c478d70 f7145d4d    13563  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31385.053182] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31385.053185] 47342 c801      0 8c478d70 f7145d4d    17980  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31387.054422] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31387.054426] 47818 c801      0 8c478d70 f7145d4d    17761  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31387.054467] 47819 c801      0 8c478d70 cebaa9a6    17762  8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> 
> 
> - mike
> 
> 
> On Sun, Aug 8, 2021 at 5:28 PM Mike Javorski <mike.javorski@xxxxxxxxx> wrote:
> >
> > Appreciate the insights Neil. I will make an attempt to
> > trigger/capture the fault if I can.
> >
> > - mike
> >
> > On Sun, Aug 8, 2021 at 5:01 PM NeilBrown <neilb@xxxxxxx> wrote:
> > >
> > > On Mon, 09 Aug 2021, Mike Javorski wrote:
> > > > I have been experiencing nfs file access hangs with multiple release
> > > > versions of the 5.13.x linux kernel. In each case, all file transfers
> > > > freeze for 5-10 seconds and then resume. This seems worse when reading
> > > > through many files sequentially.
> > >
> > > A particularly useful debugging tool for NFS freezes is to run
> > >
> > >   rpcdebug -m rpc -c all
> > >
> > > while the system appears frozen.  As you only have a 5-10 second window
> > > this might be tricky.
> > > Setting or clearing debug flags in the rpc module (whether they are
> > > already set or not) has a side effect if listing all RPC "tasks" which a
> > > waiting for a reply.  Seeing that task list can often be useful.
> > >
> > > The task list appears in "dmesg" output.  If there are not tasks
> > > waiting, nothing will be written which might lead you to think it didn't
> > > work.
> > >
> > > As Chuck hinted, tcpdump is invaluable for this sort of problem.
> > >   tcpdump -s 0 -w /tmp/somefile.pcap port 2049
> > >
> > > will capture NFS traffic.  If this can start before a hang, and finish
> > > after, it may contain useful information.  Doing that in a way that
> > > doesn't create an enormous file might be a challenge.  It would help if
> > > you found a way trigger the problem.  Take note of the circumstances
> > > when it seems to happen the most.  If you can only produce a large file,
> > > we can probably still work with it.
> > >   tshark -r /tmp/somefile.pcap
> > > will report the capture one line per packet.  You can look for the
> > > appropriate timestamp, note the frame numbers, and use "editcap"
> > > to extract a suitable range of packets.
> > >
> > > NeilBrown
> > >
> > >
> > > >
> > > > My server:
> > > > - Archlinux w/ a distribution provided kernel package
> > > > - filesystems exported with "rw,sync,no_subtree_check,insecure" options
> > > >
> > > > Client:
> > > > - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> > > > - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> > > > (ver=4.2 is indicated in mount)
> > > >
> > > > I have tried the 5.13.x kernel several times since the first arch
> > > > release (most recently with 5.13.9-arch1-1), all with similar results.
> > > > Each time, I am forced to downgrade the linux package to a 5.12.x
> > > > kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> > > > and stabilize performance. No other changes are made between tests. I
> > > > have confirmed the freezing behavior using both ext4 and btrfs
> > > > filesystems exported from this server.
> > > >
> > > > At this point I would appreciate some guidance in what to provide in
> > > > order to diagnose and resolve this issue. I don't have a lot of kernel
> > > > debugging experience, so instruction would be helpful.
> > > >
> > > > - mike
> > > >
> > > >
> 
> 




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux