Re: [PATCH RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner()

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Mon, 23 May 2022 19:13:18 +0000

> On May 23, 2022, at 2:30 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote:
>> On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote:
>>> On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote:
>>>> On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
>>>>> On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
>>>>>> 
>>>>>>> On May 23, 2022, at 11:26 AM, Jeff Layton
>>>>>>> <jlayton@xxxxxxxxxx>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
>>>>>>>> 
>>>>>>>>> On May 23, 2022, at 9:40 AM, Jeff Layton
>>>>>>>>> <jlayton@xxxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
>>>>>>>>>> nfsd4_release_lockowner() holds clp->cl_lock when it
>>>>>>>>>> calls
>>>>>>>>>> check_for_locks(). However, check_for_locks() calls
>>>>>>>>>> nfsd_file_get()
>>>>>>>>>> / nfsd_file_put() to access the backing inode's
>>>>>>>>>> flc_posix
>>>>>>>>>> list, and
>>>>>>>>>> nfsd_file_put() can sleep if the inode was recently
>>>>>>>>>> removed.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It might be good to add a might_sleep() to nfsd_file_put?
>>>>>>>> 
>>>>>>>> I intend to include the patch you reviewed last week that
>>>>>>>> adds the might_sleep(), as part of this series.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> Let's instead rely on the stateowner's reference count
>>>>>>>>>> to
>>>>>>>>>> gate
>>>>>>>>>> whether the release is permitted. This should be a
>>>>>>>>>> reliable
>>>>>>>>>> indication of locks-in-use since file lock operations
>>>>>>>>>> and
>>>>>>>>>> ->lm_get_owner take appropriate references, which are
>>>>>>>>>> released
>>>>>>>>>> appropriately when file locks are removed.
>>>>>>>>>> 
>>>>>>>>>> Reported-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
>>>>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>>>>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>>>>>>>> ---
>>>>>>>>>> fs/nfsd/nfs4state.c |    9 +++------
>>>>>>>>>> 1 file changed, 3 insertions(+), 6 deletions(-)
>>>>>>>>>> 
>>>>>>>>>> This might be a naive approach, but let's start with
>>>>>>>>>> it.
>>>>>>>>>> 
>>>>>>>>>> This passes light testing, but it's not clear how much
>>>>>>>>>> our
>>>>>>>>>> existing
>>>>>>>>>> fleet of tests exercises this area. I've locally built
>>>>>>>>>> a
>>>>>>>>>> couple of
>>>>>>>>>> pynfs tests (one is based on the one Dai posted last
>>>>>>>>>> week)
>>>>>>>>>> and they
>>>>>>>>>> pass too.
>>>>>>>>>> 
>>>>>>>>>> I don't believe that FREE_STATEID needs the same
>>>>>>>>>> simplification.
>>>>>>>>>> 
>>>>>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>>>>>>> index a280256cbb03..b77894e668a4 100644
>>>>>>>>>> --- a/fs/nfsd/nfs4state.c
>>>>>>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>>>>>>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
>>>>>>>>>> svc_rqst *rqstp,
>>>>>>>>>> 
>>>>>>>>>>                 /* see if there are still any locks
>>>>>>>>>> associated with it */
>>>>>>>>>>                 lo = lockowner(sop);
>>>>>>>>>> -               list_for_each_entry(stp, &sop-
>>>>>>>>>>> so_stateids,
>>>>>>>>>> st_perstateowner) {
>>>>>>>>>> -                       if (check_for_locks(stp-
>>>>>>>>>>> st_stid.sc_file, lo)) {
>>>>>>>>>> -                               status =
>>>>>>>>>> nfserr_locks_held;
>>>>>>>>>> -                               spin_unlock(&clp-
>>>>>>>>>>> cl_lock);
>>>>>>>>>> -                               return status;
>>>>>>>>>> -                       }
>>>>>>>>>> +               if (atomic_read(&sop->so_count) > 1) {
>>>>>>>>>> +                       spin_unlock(&clp->cl_lock);
>>>>>>>>>> +                       return nfserr_locks_held;
>>>>>>>>>>                 }
>>>>>>>>>> 
>>>>>>>>>>                 nfs4_get_stateowner(sop);
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> lm_get_owner is called from locks_copy_conflock, so if
>>>>>>>>> someone else
>>>>>>>>> happens to be doing a LOCKT or F_GETLK call at the same
>>>>>>>>> time
>>>>>>>>> that
>>>>>>>>> RELEASE_LOCKOWNER gets called, then this may end up
>>>>>>>>> returning
>>>>>>>>> an error
>>>>>>>>> inappropriately.
>>>>>>>> 
>>>>>>>> IMO releasing the lockowner while it's being used for
>>>>>>>> _anything_
>>>>>>>> seems risky and surprising. If RELEASE_LOCKOWNER succeeds
>>>>>>>> while
>>>>>>>> the client is still using the lockowner for any reason, a
>>>>>>>> subsequent error will occur if the client tries to use it
>>>>>>>> again.
>>>>>>>> Heck, I can see the server failing in mid-COMPOUND with
>>>>>>>> this
>>>>>>>> kind
>>>>>>>> of race. Better I think to just leave the lockowner in
>>>>>>>> place if
>>>>>>>> there's any ambiguity.
>>>>>>>> 
>>>>>>> 
>>>>>>> The problem here is not the client itself calling
>>>>>>> RELEASE_LOCKOWNER
>>>>>>> while it's still in use, but rather a different client
>>>>>>> altogether
>>>>>>> calling LOCKT (or a local process does a F_GETLK) on an inode
>>>>>>> where a
>>>>>>> lock is held by a client. The LOCKT gets a reference to it
>>>>>>> (for
>>>>>>> the
>>>>>>> conflock), while the client that has the lockowner releases
>>>>>>> the
>>>>>>> lock and
>>>>>>> then the lockowner while the refcount is still high.
>>>>>>> 
>>>>>>> The race window for this is probably quite small, but I think
>>>>>>> it's
>>>>>>> theoretically possible. The point is that an elevated
>>>>>>> refcount on
>>>>>>> the
>>>>>>> lockowner doesn't necessarily mean that locks are actually
>>>>>>> being
>>>>>>> held by
>>>>>>> it.
>>>>>> 
>>>>>> Sure, I get that the lockowner's reference count is not 100%
>>>>>> reliable. The question is whether it's good enough.
>>>>>> 
>>>>>> We are looking for a mechanism that can simply count the number
>>>>>> of locks held by a lockowner. It sounds like you believe that
>>>>>> lm_get_owner / put_owner might not be a reliable way to do
>>>>>> that.
>>>>>> 
>>>>>> 
>>>>>>>> The spec language does not say RELEASE_LOCKOWNER must not
>>>>>>>> return
>>>>>>>> LOCKS_HELD for other reasons, and it does say that there is
>>>>>>>> no
>>>>>>>> choice of using another NFSERR value (RFC 7530 Section
>>>>>>>> 13.2).
>>>>>>>> 
>>>>>>> 
>>>>>>> What recourse does the client have if this happens? It
>>>>>>> released
>>>>>>> all of
>>>>>>> its locks and tried to release the lockowner, but the server
>>>>>>> says
>>>>>>> "locks
>>>>>>> held". Should it just give up at that point?
>>>>>>> RELEASE_LOCKOWNER is
>>>>>>> a sort
>>>>>>> of a courtesy by the client, I suppose...
>>>>>> 
>>>>>> RELEASE_LOCKOWNER is a courtesy for the server. Most clients
>>>>>> ignore the return code IIUC.
>>>>>> 
>>>>>> So the hazard caused by this race would be a small resource
>>>>>> leak on the server that would go away once the client's lease
>>>>>> was purged.
>>>>>> 
>>>>>> 
>>>>>>>>> My guess is that that would be pretty hard to hit the
>>>>>>>>> timing right, but not impossible.
>>>>>>>>> 
>>>>>>>>> What we may want to do is have the kernel do this check
>>>>>>>>> and
>>>>>>>>> only if it
>>>>>>>>> comes back >1 do the actual check for locks. That won't
>>>>>>>>> fix
>>>>>>>>> the original
>>>>>>>>> problem though.
>>>>>>>>> 
>>>>>>>>> In other places in nfsd, we've plumbed in a dispose_list
>>>>>>>>> head
>>>>>>>>> and
>>>>>>>>> deferred the sleeping functions until the spinlock can be
>>>>>>>>> dropped. I
>>>>>>>>> haven't looked closely at whether that's possible here,
>>>>>>>>> but
>>>>>>>>> it may be a
>>>>>>>>> more reliable approach.
>>>>>>>> 
>>>>>>>> That was proposed by Dai last week.
>>>>>>>> 
>>>>>>>> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@xxxxxxxxxx/T/#u
>>>>>>>> 
>>>>>>>> Trond pointed out that if two separate clients were
>>>>>>>> releasing a
>>>>>>>> lockowner on the same inode, there is nothing that protects
>>>>>>>> the
>>>>>>>> dispose_list, and it would get corrupted.
>>>>>>>> 
>>>>>>>> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@xxxxxxxxxx/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
>>>>>>>> 
>>>>>>> 
>>>>>>> Yeah, that doesn't look like what's needed.
>>>>>>> 
>>>>>>> What I was going to suggest is a nfsd_file_put variant that
>>>>>>> takes
>>>>>>> a
>>>>>>> list_head. If the refcount goes to zero and the thing ends up
>>>>>>> being
>>>>>>> unhashed, then you put it on the dispose list rather than
>>>>>>> doing
>>>>>>> the
>>>>>>> blocking operations, and then clean it up later.
>>>>>> 
>>>>>> Trond doesn't like that approach; see the e-mail thread.
>>>>>> 
>>>>> 
>>>>> I didn't see him saying that that would be wrong, per-se, but the
>>>>> initial implementation was racy.
>>>>> 
>>>>> His suggestion was just to keep a counter in the lockowner of how
>>>>> many
>>>>> locks are associated with it. That seems like a good suggestion,
>>>>> though
>>>>> you'd probably need to add a parameter to lm_get_owner to
>>>>> indicate
>>>>> whether you were adding a new lock or just doing a conflock copy.
>>>> 
>>>> I don't think this should be necessary. The posix_lock code doesn't
>>>> ever use a struct file_lock that it hasn't allocated itself. We
>>>> should
>>>> always be calling conflock to copy from whatever struct file_lock
>>>> that
>>>> the caller passed as an argument.
>>>> 
>>>> IOW: the number of lm_get_owner and lm_put_owner calls should
>>>> always be
>>>> 100% balanced once all the locks belonging to a specific lock owner
>>>> are
>>>> removed.
>>>> 
>>> 
>>> We take references to the owner when we go to add a lock record, or
>>> when
>>> copying a conflicting lock. You want to keep a count of the former
>>> without counting the latter.
>>> 
>>> lm_get_owner gets called for both though. I don't see how you can
>>> disambiguate the two situations w/o some way to indicate that. Adding
>>> a
>>> bool argument to lm_get_owner/lm_put_owner ops would be pretty simple
>>> to
>>> implement, I think.
>>> 
>> 
>> Hmm... That should be an extremely unlikely race, given that the
>> conflicting lock reference would have to be held for long enough to
>> cover the unlock + the release_lockowner / free_stateid RPC calls from
>> the client initially holding the lock, however I agree it is a
>> theoretical possibility.
>> 
> 
> If we want to live with the possibility of that race, then Chuck's
> original patch is fine, since the object refcount would always be
> equivalent to the lock count.
> 
> That said, I think it'd be better to keep an accurate count of lock
> records (sans conflocks) in the owner.

I don't have an objection to maintaining an accurate count of locks
belonging to each lockowner. I feel comfortable (for now) that using
so_count is a good-enough solution, and is an improvement over holding
a spinlock while trying to sleep. If we go with so_count, I can add
a comment that explains the uses of so_count and possible races.

OTOH, if we alter the lm_get/put_owner API, I would like to ensure the
new parameter will be difficult to misuse. A few more random thoughts
that might count towards due diligence:

- Is there a reliable way lm_get/put_owner can distinguish between
  the two cases by looking at their current set of arguments?

- Is it clear why the conflict case needs to take a reference on a
  lockowner that is not involved? Is there a way to avoid taking that
  reference in the first place?

--
Chuck Lever