RE: NFS3 subsystem hung, Kernel alive

Jäkel, Guido <G.Jaekel@xxxxxx> · Tue, 25 Sep 2018 06:56:22 +0000

Dear Bruce,

I wrote the script rpc-stat

	root@bladerunner10 ~ # cat /opt/bin/rpc-stat 
	sed -n '/^device .* on \/ with/,/^$/ {/xprt:/ !d; p}'  /proc/self/mountstats | cut -d " " -f 7,8,9,11,12 | \
( read TX RX BAD BQ MAXSLOTS && printf "running:%-3d timeout:%-3d, queued:%-3d, max:%-5d\n"  $((TX-RX-BAD)) $BAD $BQ $MAXSLOTS )

to let it run on terminals via

	watch -d -n .1 rpc-stat

This output something like

	Every 0.1s: rpc-stat              bladerunner10: Tue Sep 25 08:43:02 2018

	running:0   timeout:56146, queued:0  , max:38632

The value of "running" (TX-RX-BAD) is mostly zero, it seem to correspond well to activity. I wonder about the "timeout" (bad XIDs) value - it seem much too high for me. The recently booted  bladerunner14  shows a unobtrusive value, but the one of the struggling  bladerunner10  seem very high to me.

	root@bladerunner14 ~ # rpc-stat 
	running:0   timeout:27 , queued:0  , max:10242

	root@bladerunner14 ~ # uptime
	 08:48:39 up 5 days, 21:25,  7 users,  load average: 10.96, 12.18, 12.47
		^--(6 users from the still running consoles with the busybox shells)

Here's the values of the other blade host used for Production stage,  bladerunner15  :

	root@bladerunner15 ~ # rpc-stat
	running:1   timeout:18 , queued:0  , max:7440

	root@bladerunner15 ~ # uptime
	 08:53:05 up 111 days, 26 min,  1 user,  load average: 20.38, 19.27, 19.40  

>-----Original Message-----
>From: 'J. Bruce Fields' [mailto:bfields@xxxxxxxxxxxx]
>Sent: Monday, September 24, 2018 11:59 PM
>To: Jäkel, Guido <G.Jaekel@xxxxxx>
>Cc: 'Jeff Layton' <jlayton@xxxxxxxxxx>; 'linux-nfs@xxxxxxxxxxxxxxx' <linux-nfs@xxxxxxxxxxxxxxx>
>Subject: Re: NFS3 subsystem hung, Kernel alive
>
>On Thu, Sep 20, 2018 at 10:52:17AM +0000, Jäkel, Guido wrote:
>> Hi all,
>>
>> Today at about "the event time" production keeps running but I discover that one of the hosts in the Test stage
>(bladerunner10) become very "stuttering" to react on commands.
>>
>> From  https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsXprt  I got some information about. And I started to
>>
>> 	watch -n 1 "sed -n '/^device .* on \/ with/,/^$/ p'  /proc/self/mountstats"
>>
>> on the hosts to watch the root mount. On  bladerunner10  I notice a very high value of the 8th field of xprt ('bad XIDs'),
>which is identical to the difference between filed 6 and 7 (TX-RX). Does that mean, that there were a high number of bad answers
>to questions? Or is this the number of replies that are out of time?
>
>I don't know what you mean by "filed 6 and 7".  Oh, wait, I guess you're
>talking about the 6th and 7th fileds of the "xprt" line in mountstats.
>
>bad_xids means the client got a response but couldn't find a matching
>reply.  I'm not sure why that would happen--maybe a response came after
>the client gave up waiting for it?
>