Re: [PATCH RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner()

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 23 May 2022 12:37:41 -0400

On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> 
> > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > 
> > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > 
> > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls
> > > > > check_for_locks(). However, check_for_locks() calls nfsd_file_get()
> > > > > / nfsd_file_put() to access the backing inode's flc_posix list, and
> > > > > nfsd_file_put() can sleep if the inode was recently removed.
> > > > > 
> > > > 
> > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > 
> > > I intend to include the patch you reviewed last week that
> > > adds the might_sleep(), as part of this series.
> > > 
> > > 
> > > > > Let's instead rely on the stateowner's reference count to gate
> > > > > whether the release is permitted. This should be a reliable
> > > > > indication of locks-in-use since file lock operations and
> > > > > ->lm_get_owner take appropriate references, which are released
> > > > > appropriately when file locks are removed.
> > > > > 
> > > > > Reported-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
> > > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > > ---
> > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > 
> > > > > This might be a naive approach, but let's start with it.
> > > > > 
> > > > > This passes light testing, but it's not clear how much our existing
> > > > > fleet of tests exercises this area. I've locally built a couple of
> > > > > pynfs tests (one is based on the one Dai posted last week) and they
> > > > > pass too.
> > > > > 
> > > > > I don't believe that FREE_STATEID needs the same simplification.
> > > > > 
> > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > index a280256cbb03..b77894e668a4 100644
> > > > > --- a/fs/nfsd/nfs4state.c
> > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
> > > > > 
> > > > > 		/* see if there are still any locks associated with it */
> > > > > 		lo = lockowner(sop);
> > > > > -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
> > > > > -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
> > > > > -				status = nfserr_locks_held;
> > > > > -				spin_unlock(&clp->cl_lock);
> > > > > -				return status;
> > > > > -			}
> > > > > +		if (atomic_read(&sop->so_count) > 1) {
> > > > > +			spin_unlock(&clp->cl_lock);
> > > > > +			return nfserr_locks_held;
> > > > > 		}
> > > > > 
> > > > > 		nfs4_get_stateowner(sop);
> > > > > 
> > > > > 
> > > > 
> > > > lm_get_owner is called from locks_copy_conflock, so if someone else
> > > > happens to be doing a LOCKT or F_GETLK call at the same time that
> > > > RELEASE_LOCKOWNER gets called, then this may end up returning an error
> > > > inappropriately.
> > > 
> > > IMO releasing the lockowner while it's being used for _anything_
> > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
> > > the client is still using the lockowner for any reason, a
> > > subsequent error will occur if the client tries to use it again.
> > > Heck, I can see the server failing in mid-COMPOUND with this kind
> > > of race. Better I think to just leave the lockowner in place if
> > > there's any ambiguity.
> > > 
> > 
> > The problem here is not the client itself calling RELEASE_LOCKOWNER
> > while it's still in use, but rather a different client altogether
> > calling LOCKT (or a local process does a F_GETLK) on an inode where a
> > lock is held by a client. The LOCKT gets a reference to it (for the
> > conflock), while the client that has the lockowner releases the lock and
> > then the lockowner while the refcount is still high.
> > 
> > The race window for this is probably quite small, but I think it's
> > theoretically possible. The point is that an elevated refcount on the
> > lockowner doesn't necessarily mean that locks are actually being held by
> > it.
> 
> Sure, I get that the lockowner's reference count is not 100%
> reliable. The question is whether it's good enough.
> 
> We are looking for a mechanism that can simply count the number
> of locks held by a lockowner. It sounds like you believe that
> lm_get_owner / put_owner might not be a reliable way to do
> that.
> 
> 
> > > The spec language does not say RELEASE_LOCKOWNER must not return
> > > LOCKS_HELD for other reasons, and it does say that there is no
> > > choice of using another NFSERR value (RFC 7530 Section 13.2).
> > > 
> > 
> > What recourse does the client have if this happens? It released all of
> > its locks and tried to release the lockowner, but the server says "locks
> > held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort
> > of a courtesy by the client, I suppose...
> 
> RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> ignore the return code IIUC.
> 
> So the hazard caused by this race would be a small resource
> leak on the server that would go away once the client's lease
> was purged.
> 
> 
> > > > My guess is that that would be pretty hard to hit the
> > > > timing right, but not impossible.
> > > > 
> > > > What we may want to do is have the kernel do this check and only if it
> > > > comes back >1 do the actual check for locks. That won't fix the original
> > > > problem though.
> > > > 
> > > > In other places in nfsd, we've plumbed in a dispose_list head and
> > > > deferred the sleeping functions until the spinlock can be dropped. I
> > > > haven't looked closely at whether that's possible here, but it may be a
> > > > more reliable approach.
> > > 
> > > That was proposed by Dai last week.
> > > 
> > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@xxxxxxxxxx/T/#u
> > > 
> > > Trond pointed out that if two separate clients were releasing a
> > > lockowner on the same inode, there is nothing that protects the
> > > dispose_list, and it would get corrupted.
> > > 
> > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@xxxxxxxxxx/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > 
> > 
> > Yeah, that doesn't look like what's needed.
> > 
> > What I was going to suggest is a nfsd_file_put variant that takes a
> > list_head. If the refcount goes to zero and the thing ends up being
> > unhashed, then you put it on the dispose list rather than doing the
> > blocking operations, and then clean it up later.
> 
> Trond doesn't like that approach; see the e-mail thread.
> 

I didn't see him saying that that would be wrong, per-se, but the
initial implementation was racy.

His suggestion was just to keep a counter in the lockowner of how many
locks are associated with it. That seems like a good suggestion, though
you'd probably need to add a parameter to lm_get_owner to indicate
whether you were adding a new lock or just doing a conflock copy.

Checking the object refcount like this patch does seems wrong though.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>