On Fri, 2013-01-18 at 19:44 -0500, Trond Myklebust wrote: > On Fri, 2013-01-18 at 18:14 -0500, Chuck Lever wrote: > > On Jan 18, 2013, at 5:59 PM, "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote: > > > > > On Fri, 2013-01-18 at 17:03 -0500, an unknown sender wrote: > > >> On Fri, 2013-01-18 at 16:33 -0500, Chuck Lever wrote: > > >>> On Jan 18, 2013, at 4:28 PM, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote: > > >>> > > >>>> Any chance the STALE_CLIENTID case needs a 'break'? > > >>> > > >>> I don't think so. LEASE_CONFIRM is set, and we want to wake the state renewal thread. > > >>> > > >>>> > > >>>> Twice I've seen kernel crashes after the nfs40_walk_client_list > > >>>> failed (though code comments say it should never fail). > > >>> > > >>> nfs40_walk_client_list() is looking for an nfs_client that is supposed to already be in the nfs_client list. If the search fails, that's a bug. > > >>> > > >>> Eyeball the contents of your nfs_client list. You should find an appropriate nfs_client in there, and then figure out why the search doesn't find it. > > >> > > >> You have considered the fact that the call to > > >> nfs4_proc_setclientid_confirm can potentially return > > >> NFS4ERR_STALE_CLIENTID if the server rebooted while the client was > > >> walking the list? > > > > > > In fact, as far as I can see, the correct behaviour in > > > nfs40_discover_server_trunking() should be to re-issue the setclientid > > > call, and then walk the list again if nfs40_walk_client_list() returns > > > NFS4ERR_STALE_CLIENTID. > > > > When I wrote the server trunking detection logic, I think we hadn't clearly decided what needed to be done in the STALE_CLIENTID case. > > > > > Something like the attached patch: > > > > A couple of comments: > > > > o nfs_get_client() already sticks the new client on the tail of the nfs_client list > > OK. That allows us to get rid of 5-6 lines of code. > > > o We don't want to get stuck in a loop here. Should the "do {}" loop in nfs40_discover_server_trunking() be bounded by a retry count? > > Not a number of retries, but possibly a timeout value. > > > However, I haven't heard Ben say "oh, yes, my server had rebooted." I'd like some confirmation that the match failed for an explainable and expected reason. > > The spec allows the server to discard unconfirmed clientids, so our code > should take that into account. > > > BTW: I'm going insane trying to figure out how the refcounting in that > code is supposed to work. The assumption that we can somehow safely > return a non-referenced pointer from nfs4_discover_server_trunking and > then bump the reference count in nfs4_init_client() drives me nuts... Attached is v2 of the patch that addresses all of the above. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com
From 5fdbe6129cc767b4d8926d4a53ce7ce075411156 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> Date: Fri, 18 Jan 2013 17:54:34 -0500 Subject: [PATCH v2] NFSv4: Fix NFSv4.0 trunking discovery If walking the list in nfs40_walk_client_list fails, then the most likely explanation is that the server rebooted. Ensure that we catch that case by testing our new nfs_client last. Also fix reference counting issues that could lead to a memory leak for nfs_client structures. Reported-by: Ben Greear <greearb@xxxxxxxxxxxxxxx> Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> Cc: Chuck Lever <chuck.lever@xxxxxxxxxx> Cc: stable@xxxxxxxxxxxxxxx [>=3.7] --- fs/nfs/nfs4client.c | 37 +++++++++++++++---------------------- fs/nfs/nfs4state.c | 22 ++++++++++------------ 2 files changed, 25 insertions(+), 34 deletions(-) diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c index acc3472..bd6efe8 100644 --- a/fs/nfs/nfs4client.c +++ b/fs/nfs/nfs4client.c @@ -234,13 +234,12 @@ struct nfs_client *nfs4_init_client(struct nfs_client *clp, nfs_mark_client_ready(clp, NFS_CS_READY); error = nfs4_discover_server_trunking(clp, &old); + nfs_put_client(clp); if (error < 0) goto error; if (clp != old) { clp->cl_preserve_clid = true; - nfs_put_client(clp); clp = old; - atomic_inc(&clp->cl_count); } return clp; @@ -332,40 +331,33 @@ int nfs40_walk_client_list(struct nfs_client *new, if (prev) nfs_put_client(prev); + prev = pos; status = nfs4_proc_setclientid_confirm(pos, &clid, cred); - if (status == 0) { + switch (status) { + case -NFS4ERR_STALE_CLIENTID: + break; + case 0: nfs4_swap_callback_idents(pos, new); - nfs_put_client(pos); + prev = NULL; *result = pos; dprintk("NFS: <-- %s using nfs_client = %p ({%d})\n", __func__, pos, atomic_read(&pos->cl_count)); - return 0; - } - if (status != -NFS4ERR_STALE_CLIENTID) { - nfs_put_client(pos); - dprintk("NFS: <-- %s status = %d, no result\n", - __func__, status); - return status; + default: + goto out; } spin_lock(&nn->nfs_client_lock); - prev = pos; } + spin_unlock(&nn->nfs_client_lock); - /* - * No matching nfs_client found. This should be impossible, - * because the new nfs_client has already been added to - * nfs_client_list by nfs_get_client(). - * - * Don't BUG(), since the caller is holding a mutex. - */ + /* No match found. The server lost our clientid */ +out: if (prev) nfs_put_client(prev); - spin_unlock(&nn->nfs_client_lock); - pr_err("NFS: %s Error: no matching nfs_client found\n", __func__); - return -NFS4ERR_STALE_CLIENTID; + dprintk("NFS: <-- %s status = %d\n", __func__, status); + return status; } #ifdef CONFIG_NFS_V4_1 @@ -473,6 +465,7 @@ int nfs41_walk_client_list(struct nfs_client *new, if (!nfs4_match_serverowners(pos, new)) continue; + atomic_inc(&pos->cl_count); spin_unlock(&nn->nfs_client_lock); dprintk("NFS: <-- %s using nfs_client = %p ({%d})\n", __func__, pos, atomic_read(&pos->cl_count)); diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index 9448c57..102885d 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -129,24 +129,22 @@ int nfs40_discover_server_trunking(struct nfs_client *clp, if (clp->cl_addr.ss_family == AF_INET6) port = nn->nfs_callback_tcpport6; - status = nfs4_proc_setclientid(clp, NFS4_CALLBACK, port, cred, &clid); - if (status != 0) - goto out; - clp->cl_clientid = clid.clientid; - clp->cl_confirm = clid.confirm; + do { + status = nfs4_proc_setclientid(clp, NFS4_CALLBACK, port, cred, &clid); + if (status != 0) + break; + clp->cl_clientid = clid.clientid; + clp->cl_confirm = clid.confirm; - status = nfs40_walk_client_list(clp, result, cred); - switch (status) { - case -NFS4ERR_STALE_CLIENTID: - set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state); - case 0: + status = nfs40_walk_client_list(clp, result, cred); + } while (status == -NFS4ERR_STALE_CLIENTID); + + if (status == 0) { /* Sustain the lease, even if it's empty. If the clientid4 * goes stale it's of no use for trunking discovery. */ nfs4_schedule_state_renewal(*result); - break; } -out: return status; } -- 1.8.1