On Tue, 05 Mar 2024, Chuck Lever wrote: > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote: > > On Tue, 05 Mar 2024, Chuck Lever wrote: > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote: > > > > move_to_close_lru() waits for sc_count to become zero while holding > > > > rp_mutex. This can deadlock if another thread holds a reference and is > > > > waiting for rp_mutex. > > > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and > > > > cannot be found any more. So code waiting for the mutex can safely > > > > retry the lookup if move_to_close_lru() has started. > > > > > > > > So change rp_mutex to an atomic_t with three states: > > > > > > > > RP_UNLOCK - state is still hashed, not locked for reply > > > > RP_LOCKED - state is still hashed, is locked for reply > > > > RP_UNHASHED - state is not hashed, no code can get a lock. > > > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be > > > > unhashed. In the latter case, retry the lookup. > > > > > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > > > --- > > > > fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++------- > > > > fs/nfsd/state.h | 2 +- > > > > 2 files changed, 32 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > index 690d0e697320..47e879d5d68a 100644 > > > > --- a/fs/nfsd/nfs4state.c > > > > +++ b/fs/nfsd/nfs4state.c > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn) > > > > atomic_set(&nn->nfsd_courtesy_clients, 0); > > > > } > > > > > > > > +enum rp_lock { > > > > + RP_UNLOCKED, > > > > + RP_LOCKED, > > > > + RP_UNHASHED, > > > > +}; > > > > + > > > > static void init_nfs4_replay(struct nfs4_replay *rp) > > > > { > > > > rp->rp_status = nfserr_serverfault; > > > > rp->rp_buflen = 0; > > > > rp->rp_buf = rp->rp_ibuf; > > > > - mutex_init(&rp->rp_mutex); > > > > + atomic_set(&rp->rp_locked, RP_UNLOCKED); > > > > } > > > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > - struct nfs4_stateowner *so) > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate, > > > > + struct nfs4_stateowner *so) > > > > { > > > > if (!nfsd4_has_session(cstate)) { > > > > - mutex_lock(&so->so_replay.rp_mutex); > > > > + wait_var_event(&so->so_replay.rp_locked, > > > > + atomic_cmpxchg(&so->so_replay.rp_locked, > > > > + RP_UNLOCKED, RP_LOCKED) != RP_LOCKED); > > > > > > What reliably prevents this from being a "wait forever" ? > > > > That same thing that reliably prevented the original mutex_lock from > > waiting forever. > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when > > we previously called mutex_unlock. But it *also* aborts the wait if > > rp_locked is set to RP_UNHASHED - so it is now more reliable. > > > > Does that answer the question? > > Hm. I guess then we are no worse off with wait_var_event(). > > I'm not as familiar with this logic as perhaps I should be. How long > does it take for the wake-up to occur, typically? wait_var_event() is paired with wake_up_var(). The wake up happens when wake_up_var() is called, which in this code is always immediately after atomic_set() updates the variable. > > > > > > + if (atomic_read(&so->so_replay.rp_locked) == RP_UNHASHED) > > > > + return -EAGAIN; > > > > cstate->replay_owner = nfs4_get_stateowner(so); > > > > } > > > > + return 0; > > > > } > > > > > > > > void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > > > @@ -4453,7 +4464,8 @@ void nfsd4_cstate_clear_replay(struct nfsd4_compound_state *cstate) > > > > > > > > if (so != NULL) { > > > > cstate->replay_owner = NULL; > > > > - mutex_unlock(&so->so_replay.rp_mutex); > > > > + atomic_set(&so->so_replay.rp_locked, RP_UNLOCKED); > > > > + wake_up_var(&so->so_replay.rp_locked); > > > > nfs4_put_stateowner(so); > > > > } > > > > } > > > > @@ -4691,7 +4703,11 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) > > > > * Wait for the refcount to drop to 2. Since it has been unhashed, > > > > * there should be no danger of the refcount going back up again at > > > > * this point. > > > > + * Some threads with a reference might be waiting for rp_locked, > > > > + * so tell them to stop waiting. > > > > */ > > > > + atomic_set(&oo->oo_owner.so_replay.rp_locked, RP_UNHASHED); > > > > + wake_up_var(&oo->oo_owner.so_replay.rp_locked); > > > > wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2); > > > > > > > > release_all_access(s); > > > > @@ -5064,11 +5080,15 @@ nfsd4_process_open1(struct nfsd4_compound_state *cstate, > > > > clp = cstate->clp; > > > > > > > > strhashval = ownerstr_hashval(&open->op_owner); > > > > +retry: > > > > oo = find_or_alloc_open_stateowner(strhashval, open, cstate); > > > > open->op_openowner = oo; > > > > if (!oo) > > > > return nfserr_jukebox; > > > > - nfsd4_cstate_assign_replay(cstate, &oo->oo_owner); > > > > + if (nfsd4_cstate_assign_replay(cstate, &oo->oo_owner) == -EAGAIN) { > > > > + nfs4_put_stateowner(&oo->oo_owner); > > > > + goto retry; > > > > + } > > > > status = nfsd4_check_seqid(cstate, &oo->oo_owner, open->op_seqid); > > > > if (status) > > > > return status; > > > > @@ -6828,11 +6848,15 @@ nfs4_preprocess_seqid_op(struct nfsd4_compound_state *cstate, u32 seqid, > > > > trace_nfsd_preprocess(seqid, stateid); > > > > > > > > *stpp = NULL; > > > > +retry: > > > > status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); > > > > if (status) > > > > return status; > > > > stp = openlockstateid(s); > > > > - nfsd4_cstate_assign_replay(cstate, stp->st_stateowner); > > > > + if (nfsd4_cstate_assign_replay(cstate, stp->st_stateowner) == -EAGAIN) { > > > > + nfs4_put_stateowner(stp->st_stateowner); > > > > + goto retry; > > > > + } > > > > > > > > status = nfs4_seqid_op_checks(cstate, stateid, seqid, stp); > > > > if (!status) > > > > > > My tool chain reports that this hunk won't apply to nfsd-next. > > > > You need better tools :-) > > <shrug> Essentially it is git, importing an mbox. That kind of thing > is generally the weakest aspect of all these tools. Do you know of > something more robust? My tool of choice is "wiggle" - which I wrote. I just put those 4 patches into an mbox and use "git am < mbox" to apply to nfsd-next. It hit a problem at this patch - 3/4. So I ran wiggle -mrp .git/rebase-apply/patch which worked without complaint. (you might want --no-backup too). But you probably want to know that the conflict was that "git am" found and "wiggle" ignored. So: git diff | diff -u -I '^@@' -I '^index' .git/rebase-apply/patch - This shows +retry: - status = nfsd4_lookup_stateid(cstate, stateid, typemask, &s, nn); + status = nfsd4_lookup_stateid(cstate, stateid, + typemask, statusmask, &s, nn); once you have had a bit of practice reading diffs of diffs you can see that the patch inserts "retry:" before a line which has changed between the original and new bases. The difference clearly isn't important to this patch. So "git add -u; git am --continue" will accept the wiggle and continue with the "git am". Patch 4/4 has one conflict: struct nfsd_net *nn = net_generic(net, nfsd_net_id); + bool need_move_to_close_list; - dprintk("NFSD: nfsd4_close on file %pd\n", + dprintk("NFSD: nfsd4_close on file %pd\n", cstate->current_fh.fh_dentry); again - not interesting (white space at end of line changed). But maybe you would rather avoid that and have submitted base their patches properly to start with :-) > > > > > In my copy of fs/nfsd/nfs4state.c, nfs4_preprocess_seqid_op() starts > > > at line 7180, so something is whack-a-doodle. > > > > I have been working against master because the old version of the fix > > was in nfsd-next. I should have rebased when you removed it from > > nfsd-next. I have now and will repost once you confirm you are > > comfortable with the answer about waiting above. Would adding a comment > > there help? > > The mechanism is clear from the code, so a new comment, if you add > one, should spell out what condition(s) mark rp_locked as UNLOCKED. > > But I might be missing something that should be obvious, in which > case no comment is necessary. > I've added a comment emphasising that it is much like a mutex, and pointing to the matching unlocks. Thanks, NeilBrown