On Fri, 2013-01-18 at 18:14 -0500, Chuck Lever wrote: > On Jan 18, 2013, at 5:59 PM, "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote: > > > On Fri, 2013-01-18 at 17:03 -0500, an unknown sender wrote: > >> On Fri, 2013-01-18 at 16:33 -0500, Chuck Lever wrote: > >>> On Jan 18, 2013, at 4:28 PM, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote: > >>> > >>>> Any chance the STALE_CLIENTID case needs a 'break'? > >>> > >>> I don't think so. LEASE_CONFIRM is set, and we want to wake the state renewal thread. > >>> > >>>> > >>>> Twice I've seen kernel crashes after the nfs40_walk_client_list > >>>> failed (though code comments say it should never fail). > >>> > >>> nfs40_walk_client_list() is looking for an nfs_client that is supposed to already be in the nfs_client list. If the search fails, that's a bug. > >>> > >>> Eyeball the contents of your nfs_client list. You should find an appropriate nfs_client in there, and then figure out why the search doesn't find it. > >> > >> You have considered the fact that the call to > >> nfs4_proc_setclientid_confirm can potentially return > >> NFS4ERR_STALE_CLIENTID if the server rebooted while the client was > >> walking the list? > > > > In fact, as far as I can see, the correct behaviour in > > nfs40_discover_server_trunking() should be to re-issue the setclientid > > call, and then walk the list again if nfs40_walk_client_list() returns > > NFS4ERR_STALE_CLIENTID. > > When I wrote the server trunking detection logic, I think we hadn't clearly decided what needed to be done in the STALE_CLIENTID case. > > > Something like the attached patch: > > A couple of comments: > > o nfs_get_client() already sticks the new client on the tail of the nfs_client list OK. That allows us to get rid of 5-6 lines of code. > o We don't want to get stuck in a loop here. Should the "do {}" loop in nfs40_discover_server_trunking() be bounded by a retry count? Not a number of retries, but possibly a timeout value. > However, I haven't heard Ben say "oh, yes, my server had rebooted." I'd like some confirmation that the match failed for an explainable and expected reason. The spec allows the server to discard unconfirmed clientids, so our code should take that into account. BTW: I'm going insane trying to figure out how the refcounting in that code is supposed to work. The assumption that we can somehow safely return a non-referenced pointer from nfs4_discover_server_trunking and then bump the reference count in nfs4_init_client() drives me nuts... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥