Dear Bruce, I wrote the script rpc-stat root@bladerunner10 ~ # cat /opt/bin/rpc-stat sed -n '/^device .* on \/ with/,/^$/ {/xprt:/ !d; p}' /proc/self/mountstats | cut -d " " -f 7,8,9,11,12 | \ ( read TX RX BAD BQ MAXSLOTS && printf "running:%-3d timeout:%-3d, queued:%-3d, max:%-5d\n" $((TX-RX-BAD)) $BAD $BQ $MAXSLOTS ) to let it run on terminals via watch -d -n .1 rpc-stat This output something like Every 0.1s: rpc-stat bladerunner10: Tue Sep 25 08:43:02 2018 running:0 timeout:56146, queued:0 , max:38632 The value of "running" (TX-RX-BAD) is mostly zero, it seem to correspond well to activity. I wonder about the "timeout" (bad XIDs) value - it seem much too high for me. The recently booted bladerunner14 shows a unobtrusive value, but the one of the struggling bladerunner10 seem very high to me. root@bladerunner14 ~ # rpc-stat running:0 timeout:27 , queued:0 , max:10242 root@bladerunner14 ~ # uptime 08:48:39 up 5 days, 21:25, 7 users, load average: 10.96, 12.18, 12.47 ^--(6 users from the still running consoles with the busybox shells) Here's the values of the other blade host used for Production stage, bladerunner15 : root@bladerunner15 ~ # rpc-stat running:1 timeout:18 , queued:0 , max:7440 root@bladerunner15 ~ # uptime 08:53:05 up 111 days, 26 min, 1 user, load average: 20.38, 19.27, 19.40 >-----Original Message----- >From: 'J. Bruce Fields' [mailto:bfields@xxxxxxxxxxxx] >Sent: Monday, September 24, 2018 11:59 PM >To: Jäkel, Guido <G.Jaekel@xxxxxx> >Cc: 'Jeff Layton' <jlayton@xxxxxxxxxx>; 'linux-nfs@xxxxxxxxxxxxxxx' <linux-nfs@xxxxxxxxxxxxxxx> >Subject: Re: NFS3 subsystem hung, Kernel alive > >On Thu, Sep 20, 2018 at 10:52:17AM +0000, Jäkel, Guido wrote: >> Hi all, >> >> Today at about "the event time" production keeps running but I discover that one of the hosts in the Test stage >(bladerunner10) become very "stuttering" to react on commands. >> >> From https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsXprt I got some information about. And I started to >> >> watch -n 1 "sed -n '/^device .* on \/ with/,/^$/ p' /proc/self/mountstats" >> >> on the hosts to watch the root mount. On bladerunner10 I notice a very high value of the 8th field of xprt ('bad XIDs'), >which is identical to the difference between filed 6 and 7 (TX-RX). Does that mean, that there were a high number of bad answers >to questions? Or is this the number of replies that are out of time? > >I don't know what you mean by "filed 6 and 7". Oh, wait, I guess you're >talking about the 6th and 7th fileds of the "xprt" line in mountstats. > >bad_xids means the client got a response but couldn't find a matching >reply. I'm not sure why that would happen--maybe a response came after >the client gave up waiting for it? >