Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Wed, 26 Aug 2015 16:09:40 -0400

On Wed, Aug 26, 2015 at 09:54:22PM +0200, Ulrich Gemkow wrote:
> Hello Bruce,
> 
> On Tuesday 25 August 2015 23:54:56 J. Bruce Fields wrote:
> > The SERVERFAULT is on SETCLIENTID_CONFIRM.
> > 
> > In nfsd4_setclientid_confirm():
> > 
> > 	conf = find_confirmed_client(clid, false, nn);
> > 	unconf = find_unconfirmed_client(clid, false, nn);
> > 	/*
> >          * We try hard to give out unique clientid's, so if we get an
> >          * attempt to confirm the same clientid with a different cred,
> >          * there's a bug somewhere.  Let's charitably assume it's our
> >          * bug.
> >          */
> >         status = nfserr_serverfault;
> >         if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> >                 goto out;
> >         if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> >                 goto out;
> > 
> > The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> > auth_unix creds.
> > 
> > The clientid that were looking up there was returned from the previous
> > SETCLIENTID, generated by this logic:
> > 
> > 	if (conf && same_verf(&conf->cl_verifier, &clverifier))
> >                 /* case 1: probable callback update */
> >                 copy_clid(new, conf);
> >         else /* case 4 (new client) or cases 2, 3 (client reboot): */
> >                 gen_clid(new, nn);
> > 
> > So it should be a brand new clientid, unless the client was reusing the old
> > verifier.
> > 
> > So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> > used on the previous boot?  That sounds like a client bug.  The linux
> > client uses a timestamp for the verifier, looks like the Solaris client
> > might too.  Is there some reason the clock on this client isn't
> > advancing on reboot?
> 
> Thank you for the analysis. But the clock of the client advances
> regularely and as one would expect.

OK, thanks for checking that.

> The client is SPARC Solaris 10 with the latest patches
> applied - I cannot believe that this client has such a
> basic NFS bug.

To confirm or deny my hypothesis, I think what we want is a longer
capture that gets the failing SETCLIENTID_CONFIRM (as seen in the
previous capture) but also shows what clientid the client was using
before the reboot.  So ideal might be something like:

	- start the capture
	- mount
	- create a file (I just want to make sure the client does at
	  least one open)
	- reboot the client
	- mount again, see the failure
	- stop the capture

> Can you think of any kind of server configuration bug
> (as said, Vanilla Linux 4.1.6) that causes this error?
> The NFS server startup system is "self-made"...

I can't think of any off hand, if there's a server-side problem here I'd
suspect the code before the configuration.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html