Re: [PATCH v7 13/31] NFSv4.1: Ensure we always run TEST/FREE_STATEID on locks

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Thu, 10 Nov 2016 15:18:21 -0500

On 10 Nov 2016, at 10:58, Benjamin Coddington wrote:

Hi Anna,

On 10 Nov 2016, at 10:01, Anna Schumaker wrote:
Do you have an estimate for when this patch will be ready?  I want to 
include it in my next bugfix pull request for 4.9.

I haven't posted because I am still trying to get to the bottom of 
another
problem where the client gets stuck in a loop sending the same stateid 
over
and over on NFS4ERR_OLD_STATEID.  I want to make sure this problem 
isn't
caused by this fix -- which I don't think it is, but I'd rather make 
sure.
If I don't make any progress on this problem by the end of today, I'll 
post
what I have.

Read on if interested in this new problem:

It looks like racing opens with the same openowner can be returned out 
of
order by the server, so the client sees stateid seqid of 2 before 1.  
Then a
LOCK sent with seqid 1 is endlessly retried if sent while doing 
recovery.

It's hard to tell if I was able to capture all the moving parts to 
describe
this problem, though.  As it takes a very long time for me to 
reproduce, and
the packet captures were dropping frames.  I'm working on manually
reproducing it now.

Anna,

I haven't gotten to the bottom of it, and so I'm not confident it isn't 
a
problem created by the fix I've been testing, which is:

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index e809498..2aa9d86 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2564,12 +2564,15 @@ static void 
nfs41_check_delegation_stateid(struct
nfs4_state *state)
 static int nfs41_check_expired_locks(struct nfs4_state *state)
 {
        int status, ret = NFS_OK;
-       struct nfs4_lock_state *lsp;
+       struct nfs4_lock_state *lsp, *tmp;
        struct nfs_server *server = NFS_SERVER(state->inode);

        if (!test_bit(LK_STATE_IN_USE, &state->flags))
                goto out;
-       list_for_each_entry(lsp, &state->lock_states, ls_locks) {
+       spin_lock(&state->state_lock);
+       list_for_each_entry_safe(lsp, tmp, &state->lock_states, 
ls_locks) {
+               atomic_inc(&lsp->ls_count);
+               spin_unlock(&state->state_lock);
                if (test_bit(NFS_LOCK_INITIALIZED, &lsp->ls_flags)) {
                        struct rpc_cred *cred =
lsp->ls_state->owner->so_cred;

@@ -2588,7 +2591,10 @@ static int nfs41_check_expired_locks(struct
nfs4_state *state)
                                break;
                        }
                }
-       };
+               nfs4_put_lock_state(lsp);
+               spin_lock(&state->state_lock);
+       }
+       spin_unlock(&state->state_lock);
 out:
        return ret;
 }

http://people.redhat.com/bcodding/old_stateid_loop is tshark output of 
my
only good wirecapture of the problem.  Without this patch, generic/089
crashes long before this problem is reproduced, so I am stuck figuring 
it
out, I'm afraid.  Don't wait on my account.

I plan on trying a bit more to reproduce tomorrow, and if I cannot, I'll
write about it under separate cover.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html