Re: User process NFS write hang in wait_on_commit with kworker

Alan Post <adp@xxxxxxxxx> · Fri, 28 Jun 2019 12:33:24 -0600

On Fri, Jun 21, 2019 at 02:47:23PM -0600, Alan Post wrote:
> > Verifying this is the problem could be done by setting up some rolling
> > network captures.. but sometimes it can be hard to not have the capture
> > fill up with continuing traffic from other processes.
> > 
> 
> I did go ahead and set up a rolling capture between this NFS
> server and one rack of clients--I hope I can catch the event as
> it happens.  Time will tell.
> 

I've run this rolling capture and did catch four candidate events.
I haven't confirmed any of them are real--I don't really know
what it is I'm looking for, so I've been approaching the problem
by incrementally/recursively throwing stuff out and manually
working through what's left.

As far as I understand it, for a particular xid, there should be a
call and a reply.  The approach I took then was to pull out these
fields from my capture and ignore RPC calls where both are present
in my capture.  It seems this is simplistic, as the number of RPC
calls I have without an attendant reply isn't lining up with my
incident window.

In one example, I have a series of READ calls which cease
generating RPC reply messages as the offset for the file continues
to increases.  After a couple/few dozen messages, the RPC replies
continue as they were.  Is there a normal or routine explanation
for this?

RFC 5531 and the NetworkTracing page on wiki.linux-nfs.org have
been quite helpful bringing me up to speed.  If any of you have
advice or guidance or can clarify my understanding of how the
call/reply RPC mechanism works I appreciate it.

-A
-- 
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: adp@xxxxxxxxx