Hi On 03/12/2018 11:45, Catalin Marinas wrote: > Hi Trond, > > On Sun, Dec 02, 2018 at 04:44:49PM +0000, Trond Myklebust wrote: >> On Fri, 2018-11-30 at 14:31 -0500, Trond Myklebust wrote: >>> On Fri, 2018-11-30 at 16:19 +0000, Cristian Marussi wrote: >>>> On 29/11/2018 19:56, Trond Myklebust wrote: >>>>> On Thu, 2018-11-29 at 19:28 +0000, Cristian Marussi wrote: >>>>> Question to you both: when this happens, does /proc/*/stack show >>>>> any of the processes hanging in the socket or sunrpc code? If >>>>> so, can you please send me examples of those stack traces (i.e. >>>>> the contents of /proc/<pid>/stack for the processes that are >>>>> hanging) >>>> >>>> (using a reverse shell since starting ssh causes a lot of pain and >>>> traffic) >>>> >>>> Looking at NFS traffic holes(30-40 secs) to detect Client side >>>> various HANGS >> >> Chuck and I have identified a few issues that might have an effect on >> the hangs you report. Could you please give the linux-next branch in my >> repository on git.linux-nfs.org ( >> https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/linux-next >> ) a try? >> >> git pull git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next > > I tried, unfortunately there's no difference for me (I merged the above > branch on top of 4.20-rc5). > same for me. Issue still there. Beside I saw some differences in the dbench result which I used for testing. >From the dbench (comparing with previous mail) it seems that Unlink and Qpathinfo MaxLat has normalized. Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 90820 13.613 13855.620 Close 66565 18.075 13853.289 Rename 3845 23.668 326.642 Unlink 18450 4.581 186.062 Qpathinfo 82068 2.677 280.203 Qfileinfo 14235 10.357 176.373 Qfsinfo 15156 2.822 242.794 Sfileinfo 7400 17.018 240.546 Find 31812 5.988 277.332 WriteX 44735 0.155 14.685 ReadX 141872 0.741 13817.870 LockX 288 10.558 96.179 UnlockX 288 3.307 57.939 Flush 6389 20.427 187.429 > Is there anything else blocked in the RPC layer? The above are all > standard tasks waiting for the rpciod/xprtiod workqueues to complete > the calls to the server. cat /proc/692/stack [<0>] __switch_to+0x6c/0x90 [<0>] rescuer_thread+0x2e8/0x360 [<0>] kthread+0x134/0x138 [<0>] ret_from_fork+0x10/0x1c [<0>] 0xffffffffffffffff I was now trying to collect more evidence ftracing during the quiet-stuck-period till the restart happens. Thanks Cristian