Re: Question on nfs40_discover_server_trunking.

Ben Greear <greearb@xxxxxxxxxxxxxxx> · Fri, 18 Jan 2013 15:21:12 -0800

On 01/18/2013 03:14 PM, Chuck Lever wrote:

On Jan 18, 2013, at 5:59 PM, "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote:

On Fri, 2013-01-18 at 17:03 -0500, an unknown sender wrote:
On Fri, 2013-01-18 at 16:33 -0500, Chuck Lever wrote:
On Jan 18, 2013, at 4:28 PM, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote:

Any chance the STALE_CLIENTID case needs a 'break'?

I don't think so.  LEASE_CONFIRM is set, and we want to wake the state renewal thread.

Twice I've seen kernel crashes after the nfs40_walk_client_list
failed (though code comments say it should never fail).

nfs40_walk_client_list() is looking for an nfs_client that is supposed to already be in the nfs_client list.  If the search fails, that's a bug.

Eyeball the contents of your nfs_client list.  You should find an appropriate nfs_client in there, and then figure out why the search doesn't find it.

You have considered the fact that the call to
nfs4_proc_setclientid_confirm can potentially return
NFS4ERR_STALE_CLIENTID if the server rebooted while the client was
walking the list?

In fact, as far as I can see, the correct behaviour in
nfs40_discover_server_trunking() should be to re-issue the setclientid
call, and then walk the list again if nfs40_walk_client_list() returns
NFS4ERR_STALE_CLIENTID.

When I wrote the server trunking detection logic, I think we hadn't clearly decided what needed to be done in the STALE_CLIENTID case.

Something like the attached patch:

A couple of comments:

  o  nfs_get_client() already sticks the new client on the tail of the nfs_client list

  o  We don't want to get stuck in a loop here.  Should the "do {}" loop in nfs40_discover_server_trunking() be bounded by a retry count?

However, I haven't heard Ben say "oh, yes, my server had rebooted."  I'd like some confirmation that the match failed for an explainable and expected reason.

The server machine did not reboot, but it's badly overloaded,
trying to serve 3000 mount points that are
constantly being brought up and torn down while
NFS write traffic is going on.

Even with all this, I've seen this particular problem only twice in
around 2 days of solid testing (I've been optimizing my user-space
app, and the better it gets, the more kernel bugs I find!)

If you have some particular debug info you want printed in
the failure cause, I'll be happy to run with that.

Thanks,
Ben

--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc  http://www.candelatech.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html