Re: [PATCH 13/20] NFS: Fix recovery from NFS4ERR_CLID_INUSE

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Thu, 26 Apr 2012 19:57:19 +0000

On Thu, 2012-04-26 at 15:46 -0400, Chuck Lever wrote:
> On Apr 26, 2012, at 3:14 PM, Myklebust, Trond wrote:
> 
> > On Thu, 2012-04-26 at 15:04 -0400, Chuck Lever wrote:
> >> On Apr 26, 2012, at 2:53 PM, Myklebust, Trond wrote:
> >> 
> >>> On Thu, 2012-04-26 at 14:43 -0400, Chuck Lever wrote:
> >>>> On Apr 26, 2012, at 12:55 PM, Myklebust, Trond wrote:
> >>>> 
> >>>>> On Thu, 2012-04-26 at 12:24 -0400, Chuck Lever wrote:
> >>>>>> On Apr 23, 2012, at 4:55 PM, Chuck Lever wrote:
> >>>>> Then lets move the flavour out of the clientid string,
> >>>> 
> >>>> Removing the flavor from the nfs_client_id4 string makes sense.
> >>>> 
> >>>>> and just settle
> >>>>> for handling CLID_INUSE by changing the flavour on the SETCLIENTID call.
> >>>> 
> >>>> This is where I get hazy.  
> >>>> 
> >>>> If I simply change the authentication flavor on the existing clp->cl_rpcclient, will this affect ongoing RENEW operations that also use this transport?  Do we want subsequent RENEW operations to use the new flavor?
> >>>> 
> >>>> Thinking hypothetically, it seems to me that CLID_INUSE is really an indication of a permanent configuration error, or a software bug, and we should not bother to recover.  But maybe that's my limited imagination.  Under what use cases do you think CLID_INUSE might occur and it might be useful to attempt recovery?
> >>>> 
> >>> 
> >>> The server caches the principal name that was used to call SETCLIENTID
> >>> when the lease was established. Any attempt to call SETCLIENTID with a
> >>> different principal will result in CLID_INUSE unless the lease has
> >>> expired.
> >>> 
> >>> So what I was proposing wasn't that you try to change the authentication
> >>> flavour on an existing nfs_client. It was that when you are probing, you
> >>> can use the CLID_INUSE reply from SETCLIENTID as a direct indication
> >>> that the server is indeed trunked, and that you already hold a lease on
> >>> that server, but that the authentication flavour that you are trying to
> >>> use is wrong.
> >> 
> >> The use case would be that my client has mounted a server via address X using authentication flavor 1, and then tries to mount the same server via address Y using authentication flavor 2.
> > 
> > ...for which the result should be that all setclientid/confirm and renew
> > requests will use flavour 1.
> 
> Agreed.
> 
> >> Do we even need to retry the SETCLIENTID and to perform a SETCLIENTID_CONFIRM in that case?
> > 
> > Yes. Otherwise we end up with 2 leases on the same server.
> 
> I don't see how...  If the second SETCLIENTID fails with CLID_INUSE then the server still has the first lease that's using flavor 1.  "Boom, done."

Sorry. I thought you were implying that we should use a different
clientid or something like that.

You still do need to issue the SETCLIENTID in order to figure out the
trunking topology so that you can map address Y to address X.

> >> Now, what about nfs4_reclaim_lease() ?  If the client sees CLID_INUSE during a lease reclaim, no trunking discovery is involved.
> > 
> > That would mean that the lease was expired, and that someone sent a
> > SETCLIENTID call to the server using our clientid string, but using the
> > wrong principal. There are 2 cases:
> > 
> > 1) Someone is spoofing our client. I've no idea how to recover from
> > this, short of changing the clientid string.
> 
> Maybe we should keep cl_uniquifier for case 1...?  Since nfs4_reclaim_lease() is called in the state manager, it has to do something to recover or make the waiting process error out.

Yes, but that breaks the UCS trunking-detection model. Spoofing is bad
no matter what happens, and papering around it with cl_uniquifier was
wrong. What if you just happened to use the correct principal (very easy
if you are using AUTH_SYS) and didn't detect that the clientid is being
spoofed?

> > 2) The server is trunked, the lease expired, and we happened to call
> > 'mount' while it was expired, and inadvertently sent a SETCLIENTID
> > +SETCLIENTID_CONFIRM call to the server using a different IP address,
> > and using the wrong principal.
> 
> The clid_init_mutex should exclude this case...?

I assume so...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥