On Fri, 2016-07-29 at 12:43 -0400, Nick Bowler wrote: > Hi guys, > > > On 2015-10-13, Nick Bowler <nbowler@xxxxxxxxxx> wrote: > > > > > > On 2015-10-13, Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote: > > > > > > On Mon, 12 Oct 2015 23:01:36 -0400 > > > > > > Nick Bowler <nbowler@xxxxxxxxxx> wrote: > > > > > > > > On 2015-10-12 15:46 -0400, J. Bruce Fields wrote: > > > > > > > > > > On Mon, Oct 12, 2015 at 03:25:38PM -0400, bfields wrote: > > > > > > > > > > > > On Mon, Oct 12, 2015 at 12:48:56PM -0400, Nick Bowler wrote: > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the failing syscall seems to be: > > > > > > > > > > > > > > fcntl(7, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, > > > > > > > start=1073741824, len=1}) = -1 EIO (Input/output error) > > > > > > > > > > > > > > When the issue occurs, the client dmesg log is full of messages of > > > > > > > the form: > > > > > > > > > > > > > > [3441972.381211] NFS: v4 server returned a bad sequence-id error > > > > > > > on an unconfirmed sequence ffff88007612ae20! > > > > > > > > > > > > > > There are no unusual messages on the server. > > > > [...] > > > Ok, makes sense. The log shows that it occurred in a fcntl call, so > > > it's probably this from lookup_or_create_lock_state: > > > > > > lo = find_lockowner_str(cl, &lock->lk_new_owner); > > > if (!lo) { > > > strhashval = ownerstr_hashval(&lock->lk_new_owner); > > > lo = alloc_init_lock_stateowner(strhashval, cl, ost, > > > lock); > > > if (lo == NULL) > > > return nfserr_jukebox; > > > } else { > > > /* with an existing lockowner, seqids must be the same */ > > > status = nfserr_bad_seqid; > > > if (!cstate->minorversion && > > > lock->lk_new_lock_seqid != lo->lo_owner.so_seqid) > > > goto out; > > > } > > > > > > ...so we found an existing lockowner, but the seqid in the call is > > > wrong. It seems like the client ought to try to recover in this case, > > > but I don't see where it handles BAD_SEQID errors in the locking code. > [...] > > > > > > > > In any case, the question now is whether this is a client or server > > > bug. What would tell us that is a network capture of the NFS traffic > > > between client and server at the time that this occurs. Would it be > > > possible to collect one? If so, then let Bruce and I know and we can > > > figure out a way to share it privately. > > Hi guys, > > Unfortunately I did not manage to perform a network capture last time > due to power loss. I did not hit this issue again until yesterday (~9 > months later), this time after 45 days of uptime. > > Kernel versions now are: 4.5.1 on the server, and 4.4.3 on the client. > > Since it's now in a failing state again (this situation persists until > a reboot of the client), I captured with strace and tcpdump (on both > client and server) when attempting to start gmpc, the result is quite > small (just 30 packets). Will that be helpful? > > Thanks, > Nick I doubt we'd be able to tell much after the fact, but feel free to send it along. Thanks, -- Jeff Layton <jlayton@xxxxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html