On Nov 12, 2013, at 12:30 PM, Myklebust, Trond <Trond.Myklebust@xxxxxxxxxx> wrote: > On Tue, 2013-11-12 at 11:23 -0500, Chuck Lever wrote: >> On Nov 12, 2013, at 11:20 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>> Ok, I think I see the problem. The looping comes from this block in >>> nfs4_discover_server_trunking: >>> >>> -----------------[snip]----------------- >>> case -NFS4ERR_CLID_INUSE: >>> case -NFS4ERR_WRONGSEC: >>> clnt = rpc_clone_client_set_auth(clnt, RPC_AUTH_UNIX); >>> if (IS_ERR(clnt)) { >>> status = PTR_ERR(clnt); >>> break; >>> } >>> /* Note: this is safe because we haven't yet marked the >>> * client as ready, so we are the only user of >>> * clp->cl_rpcclient >>> */ >>> clnt = xchg(&clp->cl_rpcclient, clnt); >>> rpc_shutdown_client(clnt); >>> clnt = clp->cl_rpcclient; >>> goto again; >>> -----------------[snip]----------------- >>> >>> ...so in the case of the reproducer, we get back -NFS4ERR_CLID_IN_USE, >>> at that point we call rpc_clone_client_set_auth(), which creates a new >>> rpc_clnt, but it's created as a child of the original. >>> >>> When rpc_shutdown_client is called, the original clnt is not destroyed >>> because the child still holds a reference to it. So, we go and try the >>> call again and it fails with the same error over and over again, and we >>> end up with a long chain of rpc_clnt's. >>> >>> How that ends up smashing the stack, I'm not sure though. I'm also not >>> sure of the remedy. It seems like we might ought to have some upper >>> bound on the number of SETCLIENTID attempts? >> >> CLID_INUSE is supposed to be a permanent error now. I think one retry, if any, is all that is appropriate. > > Right. If we hit CLID_INUSE in nfs4_discover_server_trunking then > > a) we know this is a server that we've already mounted > b) we know that either nfs4_init_client set us up with RPC_AUTH_UNIX to > begin with, or that rpc.gssd was started only after we'd already sent a > SETCLIENTID/EXCHANGE_ID using RPC_AUTH_UNIX to this server > > so the correct thing to do is to retry once if we know that we're not > already using AUTH_SYS, and then to EPERM. > > > Now that said, I agree that this should not be able to trigger a stack > overflow. Is this NFSv4 or NFSv4.1/NFSv4.2? Have either of you (Jeff and > Dros) tried enabling DEBUG_STACKOVERFLOW? IIRC it was a v4.0 mount when I hit this. Yes, I have CONFIG_DEBUG_STACKOVERFLOW=y. -dros > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@xxxxxxxxxx > www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html