Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Tue, 25 Aug 2015 17:54:56 -0400

On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
> Hello Bruce,
> 
> On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> > > 
> > > When mounting a share on the Solaris client and then rebooting
> > > the client without unmounting the share first, after the reboot
> > > every attempt to mount the share again gives an I/O error on
> > > the client and the mount fails.
> > > 
> > > After a long time (serveral hours) the v4 mount suddenly works
> > > again.
> > > 
> > > Mounting a share with vers=2 works always even in times when
> > > the v4 mount fails.
> > > 
> > > So it seems the Linux NFSv4 server holds a state for the client
> > > which prevents the re-mounting of the share and gives the
> > > I/O-error on the client.
> > > 
> > > We use NFSv4 without idmapd.
> > > 
> > > Is there any tip how to debug or solve this?
> > 
> > Best is probably to get a packet trace.  So something like:
> > 
> > 	tcpdump -s0 -iem0 -wtmp.pcap
> > 
> > and then try the client mount, then kill the tcpdump after the mount
> > fails, and send us tmp.pcap.  (And/or take a look at tmp.pcap yourself
> > with wireshark.  The interesting question is what kind of error the
> > server is returning when the client tries the mount after reboot.)
> 
> Thank you for your reply. The tcpdump is attached, the relevant
> packets are 49..52. The error seems to be a SERVERFAULT. Can you
> see more from the dump?
> 
> Thanks again and best regards

The SERVERFAULT is on SETCLIENTID_CONFIRM.

In nfsd4_setclientid_confirm():

	conf = find_confirmed_client(clid, false, nn);
	unconf = find_unconfirmed_client(clid, false, nn);
	/*
         * We try hard to give out unique clientid's, so if we get an
         * attempt to confirm the same clientid with a different cred,
         * there's a bug somewhere.  Let's charitably assume it's our
         * bug.
         */
        status = nfserr_serverfault;
        if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
                goto out;
        if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
                goto out;

The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
auth_unix creds.

The clientid that were looking up there was returned from the previous
SETCLIENTID, generated by this logic:

	if (conf && same_verf(&conf->cl_verifier, &clverifier))
                /* case 1: probable callback update */
                copy_clid(new, conf);
        else /* case 4 (new client) or cases 2, 3 (client reboot): */
                gen_clid(new, nn);

So it should be a brand new clientid, unless the client was reusing the old
verifier.

So perhaps the client is sending the SETCLIENTID with a verifier set to what it
used on the previous boot?  That sounds like a client bug.  The linux
client uses a timestamp for the verifier, looks like the Solaris client
might too.  Is there some reason the clock on this client isn't
advancing on reboot?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html