Re: NFS server regression in kernel 5.13 (tested w/ 5.13.9)

"NeilBrown" <neilb@xxxxxxx> · Fri, 20 Aug 2021 10:31:22 +1000

On Tue, 17 Aug 2021, NeilBrown wrote:
> On Tue, 17 Aug 2021, Mike Javorski wrote:
> > Thanks for the pointer Chuck
> > 
> > I ran the trace capture and was able to trigger two freezes in
> > relatively short order. Here is that trace file:
> > https://drive.google.com/file/d/1qk_VIMkj8aTeQIg5O0W3AvWyeDSWL3vW/view?usp=sharing
> > 
> 
> There are gaps of 5.3sec, 4sec, 1sec and 1sec between adjacent trace
> points.
> The longest gap between 'start' and 'done' for any given xid is 354msec.
> So it doesn't look like the filesystem read causing the problem.
> 
> The long gaps between adjacent records are:
> 0.354581 xid=0x4c6ac2b6
> 0.418340 xid=0x6a71c2b6
> 1.013260 xid=0x6f71c2b6
> 1.020742 xid=0x0f71c2b6
> 4.064527 xid=0x6171c2b6
> 5.396559 xid=0xd66dc2b6
> 
> The fact 1, 1, and 4 are so close to a precise number of seconds seems
> unlikely to be a co-incidence.  It isn't clear what it might mean
> though.
> 
> I don't have any immediae ideas where to look next.  I'll let you know
> if I come up with something.

I had separate bug report
  https://bugzilla.suse.com/show_bug.cgi?id=1189508
with broadly similar symptoms which strongly points to network-layer
problem.  So I'm digging back into that tcpdump.

The TCP ACK for READ requests usually goes out in 100 or 200
microseconds.  Sometimes longer - by a factor of about 100.

% tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap  -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.msgtyp==0 && nfs.opcode==25' -T fields -e tcp.seq -e frame.time_epoch -e frame.number | sed 's/$/ READ/' > /tmp/read-times
% tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap  -d tcp.port==2049,rpc -Y 'tcp.srcport==2049' -T fields -e tcp.ack -e frame.time_epoch -e frame.number | sed 's/$/ ACK/' > /tmp/ack-times
% sort -n /tmp/read-times /tmp/ack-times  | awk '$4 == "ACK" && readtime {printf "%f %d\n", ($2 - readtime), $3; readtime=0} $4 == "READ" {readtime=$2}' | sort -n | tail -4
0.000360 441
0.012777 982
0.013318 1086
0.042356 1776

Three times (out of 245) the TCP ACK was 12 milliseconds or longer.  The
first and last were times when the reply was delayed by a couple of
seconds at least.  The other (1086) was the reply to 1085 - no obvious
delay to the READ reply.

These unusual delays (short though they are) suggest something odd in
the network layer - particularly as they are sometimes followed by a
much larger delay in a READ reply.

It might be as simple as a memory allocation delay.  It might be
something deep in the networking layer.

If you still have (or could generate) some larger tcp dumps, you could
try that code and see if 5.12 shows any ACK delays, and if 5.13 shows
more delays than turn into READ delays.  IF 5.13 easily shows a few ACK
delays thats 5.12 doesn't, then they might make a git-bisect more
reliable.  Having thousands of READs in the trace rather than just 245
should produce more reliable data.

As I suggested in the bug report I linked above, I am suspicious of TCP
offload when I see symptoms like this.  You can use "ethtool" to turn
off that various offload features.  Doing that and trying to reproduce
without offload might also provide useful information.

NeilBrown