Re: User process NFS write hang in wait_on_commit with kworker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 19, 2019 at 08:38:02AM -0400, Benjamin Coddington wrote:
> TCP drops or overruns should not be a problem since the TCP layer will
> retransmit packets that are not acked.  The issue would be if the NFS
> server is perhaps silently dropping a response to an IO RPC.  Or, an
> intelligent middle-box that keeps its own stateful transparent TCP handling
> between client and server existed (you clearly don't have that here).
> 

My conclusion as well.  As part of debugging a complicity
of reliability issues with the cluster, we've found that
some workloads are more likely to lead to NFS client hang.
We've migrated the exports used by those workloads to dedicated
NFS servers, one of which is the server under discussion here.


> So I recall some knfsd issues dropping replies in that era of kernel
> versions when the GSS sequencing grew out of a window.  Are you using a
> sec=krb5* on these mounts, or is it all sec=sys?  Perhaps that's the problem
> you are seeing.  Again, just some guessing.
> 

We're using sec=sys for the NFS clients that hung on
wait_on_commit, but have in the past used Kerberos.  I'm still
chasing down at least intermittent, lingering issue where an
open(2) will return EIO, while on the the wire those procedures
are returning NFS4ERR_EXPIRED.  What appears to happening,
though I'm not certain yet, is that a RENEW CID is or tries to
be done with Kerberos when it was not previously, which succeeds,
but only in this degraded manner.

I cannot then rule out something of the sort you're describing.
Thank you for bringing it to my attention.


> Verifying this is the problem could be done by setting up some rolling
> network captures.. but sometimes it can be hard to not have the capture
> fill up with continuing traffic from other processes.
> 

I did go ahead and set up a rolling capture between this NFS
server and one rack of clients--I hope I can catch the event as
it happens.  Time will tell.

Regards,

-A
-- 
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: adp@xxxxxxxxx



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux