Re: [PATCH RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner()

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Mon, 23 May 2022 15:41:51 +0000

> On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
>> 
>>> On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>> 
>>> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
>>>> nfsd4_release_lockowner() holds clp->cl_lock when it calls
>>>> check_for_locks(). However, check_for_locks() calls nfsd_file_get()
>>>> / nfsd_file_put() to access the backing inode's flc_posix list, and
>>>> nfsd_file_put() can sleep if the inode was recently removed.
>>>> 
>>> 
>>> It might be good to add a might_sleep() to nfsd_file_put?
>> 
>> I intend to include the patch you reviewed last week that
>> adds the might_sleep(), as part of this series.
>> 
>> 
>>>> Let's instead rely on the stateowner's reference count to gate
>>>> whether the release is permitted. This should be a reliable
>>>> indication of locks-in-use since file lock operations and
>>>> ->lm_get_owner take appropriate references, which are released
>>>> appropriately when file locks are removed.
>>>> 
>>>> Reported-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>> ---
>>>> fs/nfsd/nfs4state.c |    9 +++------
>>>> 1 file changed, 3 insertions(+), 6 deletions(-)
>>>> 
>>>> This might be a naive approach, but let's start with it.
>>>> 
>>>> This passes light testing, but it's not clear how much our existing
>>>> fleet of tests exercises this area. I've locally built a couple of
>>>> pynfs tests (one is based on the one Dai posted last week) and they
>>>> pass too.
>>>> 
>>>> I don't believe that FREE_STATEID needs the same simplification.
>>>> 
>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>> index a280256cbb03..b77894e668a4 100644
>>>> --- a/fs/nfsd/nfs4state.c
>>>> +++ b/fs/nfsd/nfs4state.c
>>>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
>>>> 
>>>> 		/* see if there are still any locks associated with it */
>>>> 		lo = lockowner(sop);
>>>> -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
>>>> -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
>>>> -				status = nfserr_locks_held;
>>>> -				spin_unlock(&clp->cl_lock);
>>>> -				return status;
>>>> -			}
>>>> +		if (atomic_read(&sop->so_count) > 1) {
>>>> +			spin_unlock(&clp->cl_lock);
>>>> +			return nfserr_locks_held;
>>>> 		}
>>>> 
>>>> 		nfs4_get_stateowner(sop);
>>>> 
>>>> 
>>> 
>>> lm_get_owner is called from locks_copy_conflock, so if someone else
>>> happens to be doing a LOCKT or F_GETLK call at the same time that
>>> RELEASE_LOCKOWNER gets called, then this may end up returning an error
>>> inappropriately.
>> 
>> IMO releasing the lockowner while it's being used for _anything_
>> seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
>> the client is still using the lockowner for any reason, a
>> subsequent error will occur if the client tries to use it again.
>> Heck, I can see the server failing in mid-COMPOUND with this kind
>> of race. Better I think to just leave the lockowner in place if
>> there's any ambiguity.
>> 
> 
> The problem here is not the client itself calling RELEASE_LOCKOWNER
> while it's still in use, but rather a different client altogether
> calling LOCKT (or a local process does a F_GETLK) on an inode where a
> lock is held by a client. The LOCKT gets a reference to it (for the
> conflock), while the client that has the lockowner releases the lock and
> then the lockowner while the refcount is still high.
> 
> The race window for this is probably quite small, but I think it's
> theoretically possible. The point is that an elevated refcount on the
> lockowner doesn't necessarily mean that locks are actually being held by
> it.

Sure, I get that the lockowner's reference count is not 100%
reliable. The question is whether it's good enough.

We are looking for a mechanism that can simply count the number
of locks held by a lockowner. It sounds like you believe that
lm_get_owner / put_owner might not be a reliable way to do
that.

>> The spec language does not say RELEASE_LOCKOWNER must not return
>> LOCKS_HELD for other reasons, and it does say that there is no
>> choice of using another NFSERR value (RFC 7530 Section 13.2).
>> 
> 
> What recourse does the client have if this happens? It released all of
> its locks and tried to release the lockowner, but the server says "locks
> held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort
> of a courtesy by the client, I suppose...

RELEASE_LOCKOWNER is a courtesy for the server. Most clients
ignore the return code IIUC.

So the hazard caused by this race would be a small resource
leak on the server that would go away once the client's lease
was purged.

>>> My guess is that that would be pretty hard to hit the
>>> timing right, but not impossible.
>>> 
>>> What we may want to do is have the kernel do this check and only if it
>>> comes back >1 do the actual check for locks. That won't fix the original
>>> problem though.
>>> 
>>> In other places in nfsd, we've plumbed in a dispose_list head and
>>> deferred the sleeping functions until the spinlock can be dropped. I
>>> haven't looked closely at whether that's possible here, but it may be a
>>> more reliable approach.
>> 
>> That was proposed by Dai last week.
>> 
>> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@xxxxxxxxxx/T/#u
>> 
>> Trond pointed out that if two separate clients were releasing a
>> lockowner on the same inode, there is nothing that protects the
>> dispose_list, and it would get corrupted.
>> 
>> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@xxxxxxxxxx/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
>> 
> 
> Yeah, that doesn't look like what's needed.
> 
> What I was going to suggest is a nfsd_file_put variant that takes a
> list_head. If the refcount goes to zero and the thing ends up being
> unhashed, then you put it on the dispose list rather than doing the
> blocking operations, and then clean it up later.

Trond doesn't like that approach; see the e-mail thread.

> That said, nfsd_file_put has grown significantly in complexity over the
> years, so maybe that's not simple to do now.
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

--
Chuck Lever