Re: Question on nfs40_discover_server_trunking.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/18/2013 02:43 PM, Chuck Lever wrote:

On Jan 18, 2013, at 5:34 PM, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote:

On 01/18/2013 02:29 PM, Chuck Lever wrote:

On Jan 18, 2013, at 4:59 PM, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote:

On 01/18/2013 01:33 PM, Chuck Lever wrote:

On Jan 18, 2013, at 4:28 PM, Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote:

Any chance the STALE_CLIENTID case needs a 'break'?

I don't think so.  LEASE_CONFIRM is set, and we want to wake the state renewal thread.


Twice I've seen kernel crashes after the nfs40_walk_client_list
failed (though code comments say it should never fail).

nfs40_walk_client_list() is looking for an nfs_client that is supposed to already be in the nfs_client list.  If the search fails, that's a bug.

Eyeball the contents of your nfs_client list.  You should find an appropriate nfs_client in there, and then figure out why the search doesn't find it.

Ok, I think I found another issue.

nfs4_client_init does not initialize 'old', but passes it to nfs4_discover_server_trunking.

That gets passed to the detect_trunking operation, which is nfs40_discover_server_trunking
or nfs41_discover_server_trunking.

Yes, *result is an output parameter, not an input parameter.

This will call walk_client_list, which also may not ever assign a value to
'result'.

It should always assign *result in the SUCCESS case.

The code in walk_client_list always dereferences result, however.

So, that is probably why my system blows up shortly after the 'impossible'
error message...

In particular, nfs4_walk_client_list() does not set *result when it returns NFS4ERR_STALE_CLIENTID.  In nfs4_discover_server_trunking(), should we therefore have this:

      status = nfs40_walk_client_list(clp, result, cred);
      switch (status) {
      case -NFS4ERR_STALE_CLIENTID:
              set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);
           nfs4_schedule_state_renewal(clp);
           break;
      case 0:

I'm not sure the nfs4_schedule_state_renewal() call is necessary here.


I'm not sure I follow you entirely..but I'm going to start testing this patch in
a minute.  Looks like it should fix my crash, and maybe give clues about
why we get to the error case in the first place.


diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index d6b39a9..3188283 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -307,6 +307,8 @@ int nfs40_walk_client_list(struct nfs_client *new,
        };
        int status;

+       *result = NULL;
+
        spin_lock(&nn->nfs_client_lock);
        list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
                /* If "pos" isn't marked ready, we can't trust the
@@ -364,6 +366,7 @@ int nfs40_walk_client_list(struct nfs_client *new,
                nfs_put_client(prev);
        spin_unlock(&nn->nfs_client_lock);
        pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
+       WARN_ON(1);
        return -NFS4ERR_STALE_CLIENTID;
}

@@ -433,6 +436,8 @@ int nfs41_walk_client_list(struct nfs_client *new,
        struct nfs_client *pos, *n, *prev = NULL;
        int error;

+       *result = NULL;
+

The design of this code is that output parameters should never be touched except in the "success" case.

        spin_lock(&nn->nfs_client_lock);
        list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
                /* If "pos" isn't marked ready, we can't trust the
@@ -489,6 +494,7 @@ int nfs41_walk_client_list(struct nfs_client *new,
         */
        spin_unlock(&nn->nfs_client_lock);
        pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
+       WARN_ON(1);
        return -NFS4ERR_STALE_CLIENTID;
}
#endif /* CONFIG_NFS_V4_1 */
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index c351e6b..54d29e2 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -142,7 +142,8 @@ int nfs40_discover_server_trunking(struct nfs_client *clp,
        case 0:
                /* Sustain the lease, even if it's empty.  If the clientid4
                 * goes stale it's of no use for trunking discovery. */
-               nfs4_schedule_state_renewal(*result);
+               if (*result)
+                       nfs4_schedule_state_renewal(*result);

In the STALE_CLIENTID case, this would do state renewal on the wrong nfs_client.

                break;
        }

Let's look at this again:

  138         status = nfs40_walk_client_list(clp, result, cred);
  139         switch (status) {
  140         case -NFS4ERR_STALE_CLIENTID:
  141                 set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);

This arm of the switch operates on "clp" not on "*result".  So we want to add:

        nfs4_schedule_state_renewal(clp);
        break;

That makes initializing *result above completely unnecessary.

Ok, I'll test that, but I still think the result pointer should be initialized
somewhere.  If a bug like this ever creeps in again, dereferencing a NULL
pointer should be a lot more obvious than dereferencing some random thing.



  142         case 0:
  143                 /* Sustain the lease, even if it's empty.  If the clientid4
  144                  * goes stale it's of no use for trunking discovery. */
  145                 nfs4_schedule_state_renewal(*result);
  146                 break;
  147         }



--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc  http://www.candelatech.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux