> On May 23, 2022, at 2:30 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote: >> On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote: >>> On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote: >>>> On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: >>>>> On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: >>>>>> >>>>>>> On May 23, 2022, at 11:26 AM, Jeff Layton >>>>>>> <jlayton@xxxxxxxxxx> >>>>>>> wrote: >>>>>>> >>>>>>> On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: >>>>>>>> >>>>>>>>> On May 23, 2022, at 9:40 AM, Jeff Layton >>>>>>>>> <jlayton@xxxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: >>>>>>>>>> nfsd4_release_lockowner() holds clp->cl_lock when it >>>>>>>>>> calls >>>>>>>>>> check_for_locks(). However, check_for_locks() calls >>>>>>>>>> nfsd_file_get() >>>>>>>>>> / nfsd_file_put() to access the backing inode's >>>>>>>>>> flc_posix >>>>>>>>>> list, and >>>>>>>>>> nfsd_file_put() can sleep if the inode was recently >>>>>>>>>> removed. >>>>>>>>>> >>>>>>>>> >>>>>>>>> It might be good to add a might_sleep() to nfsd_file_put? >>>>>>>> >>>>>>>> I intend to include the patch you reviewed last week that >>>>>>>> adds the might_sleep(), as part of this series. >>>>>>>> >>>>>>>> >>>>>>>>>> Let's instead rely on the stateowner's reference count >>>>>>>>>> to >>>>>>>>>> gate >>>>>>>>>> whether the release is permitted. This should be a >>>>>>>>>> reliable >>>>>>>>>> indication of locks-in-use since file lock operations >>>>>>>>>> and >>>>>>>>>> ->lm_get_owner take appropriate references, which are >>>>>>>>>> released >>>>>>>>>> appropriately when file locks are removed. >>>>>>>>>> >>>>>>>>>> Reported-by: Dai Ngo <dai.ngo@xxxxxxxxxx> >>>>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>>>>>>>>> Cc: stable@xxxxxxxxxxxxxxx >>>>>>>>>> --- >>>>>>>>>> fs/nfsd/nfs4state.c | 9 +++------ >>>>>>>>>> 1 file changed, 3 insertions(+), 6 deletions(-) >>>>>>>>>> >>>>>>>>>> This might be a naive approach, but let's start with >>>>>>>>>> it. >>>>>>>>>> >>>>>>>>>> This passes light testing, but it's not clear how much >>>>>>>>>> our >>>>>>>>>> existing >>>>>>>>>> fleet of tests exercises this area. I've locally built >>>>>>>>>> a >>>>>>>>>> couple of >>>>>>>>>> pynfs tests (one is based on the one Dai posted last >>>>>>>>>> week) >>>>>>>>>> and they >>>>>>>>>> pass too. >>>>>>>>>> >>>>>>>>>> I don't believe that FREE_STATEID needs the same >>>>>>>>>> simplification. >>>>>>>>>> >>>>>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>>>>>>>> index a280256cbb03..b77894e668a4 100644 >>>>>>>>>> --- a/fs/nfsd/nfs4state.c >>>>>>>>>> +++ b/fs/nfsd/nfs4state.c >>>>>>>>>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct >>>>>>>>>> svc_rqst *rqstp, >>>>>>>>>> >>>>>>>>>> /* see if there are still any locks >>>>>>>>>> associated with it */ >>>>>>>>>> lo = lockowner(sop); >>>>>>>>>> - list_for_each_entry(stp, &sop- >>>>>>>>>>> so_stateids, >>>>>>>>>> st_perstateowner) { >>>>>>>>>> - if (check_for_locks(stp- >>>>>>>>>>> st_stid.sc_file, lo)) { >>>>>>>>>> - status = >>>>>>>>>> nfserr_locks_held; >>>>>>>>>> - spin_unlock(&clp- >>>>>>>>>>> cl_lock); >>>>>>>>>> - return status; >>>>>>>>>> - } >>>>>>>>>> + if (atomic_read(&sop->so_count) > 1) { >>>>>>>>>> + spin_unlock(&clp->cl_lock); >>>>>>>>>> + return nfserr_locks_held; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> nfs4_get_stateowner(sop); >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> lm_get_owner is called from locks_copy_conflock, so if >>>>>>>>> someone else >>>>>>>>> happens to be doing a LOCKT or F_GETLK call at the same >>>>>>>>> time >>>>>>>>> that >>>>>>>>> RELEASE_LOCKOWNER gets called, then this may end up >>>>>>>>> returning >>>>>>>>> an error >>>>>>>>> inappropriately. >>>>>>>> >>>>>>>> IMO releasing the lockowner while it's being used for >>>>>>>> _anything_ >>>>>>>> seems risky and surprising. If RELEASE_LOCKOWNER succeeds >>>>>>>> while >>>>>>>> the client is still using the lockowner for any reason, a >>>>>>>> subsequent error will occur if the client tries to use it >>>>>>>> again. >>>>>>>> Heck, I can see the server failing in mid-COMPOUND with >>>>>>>> this >>>>>>>> kind >>>>>>>> of race. Better I think to just leave the lockowner in >>>>>>>> place if >>>>>>>> there's any ambiguity. >>>>>>>> >>>>>>> >>>>>>> The problem here is not the client itself calling >>>>>>> RELEASE_LOCKOWNER >>>>>>> while it's still in use, but rather a different client >>>>>>> altogether >>>>>>> calling LOCKT (or a local process does a F_GETLK) on an inode >>>>>>> where a >>>>>>> lock is held by a client. The LOCKT gets a reference to it >>>>>>> (for >>>>>>> the >>>>>>> conflock), while the client that has the lockowner releases >>>>>>> the >>>>>>> lock and >>>>>>> then the lockowner while the refcount is still high. >>>>>>> >>>>>>> The race window for this is probably quite small, but I think >>>>>>> it's >>>>>>> theoretically possible. The point is that an elevated >>>>>>> refcount on >>>>>>> the >>>>>>> lockowner doesn't necessarily mean that locks are actually >>>>>>> being >>>>>>> held by >>>>>>> it. >>>>>> >>>>>> Sure, I get that the lockowner's reference count is not 100% >>>>>> reliable. The question is whether it's good enough. >>>>>> >>>>>> We are looking for a mechanism that can simply count the number >>>>>> of locks held by a lockowner. It sounds like you believe that >>>>>> lm_get_owner / put_owner might not be a reliable way to do >>>>>> that. >>>>>> >>>>>> >>>>>>>> The spec language does not say RELEASE_LOCKOWNER must not >>>>>>>> return >>>>>>>> LOCKS_HELD for other reasons, and it does say that there is >>>>>>>> no >>>>>>>> choice of using another NFSERR value (RFC 7530 Section >>>>>>>> 13.2). >>>>>>>> >>>>>>> >>>>>>> What recourse does the client have if this happens? It >>>>>>> released >>>>>>> all of >>>>>>> its locks and tried to release the lockowner, but the server >>>>>>> says >>>>>>> "locks >>>>>>> held". Should it just give up at that point? >>>>>>> RELEASE_LOCKOWNER is >>>>>>> a sort >>>>>>> of a courtesy by the client, I suppose... >>>>>> >>>>>> RELEASE_LOCKOWNER is a courtesy for the server. Most clients >>>>>> ignore the return code IIUC. >>>>>> >>>>>> So the hazard caused by this race would be a small resource >>>>>> leak on the server that would go away once the client's lease >>>>>> was purged. >>>>>> >>>>>> >>>>>>>>> My guess is that that would be pretty hard to hit the >>>>>>>>> timing right, but not impossible. >>>>>>>>> >>>>>>>>> What we may want to do is have the kernel do this check >>>>>>>>> and >>>>>>>>> only if it >>>>>>>>> comes back >1 do the actual check for locks. That won't >>>>>>>>> fix >>>>>>>>> the original >>>>>>>>> problem though. >>>>>>>>> >>>>>>>>> In other places in nfsd, we've plumbed in a dispose_list >>>>>>>>> head >>>>>>>>> and >>>>>>>>> deferred the sleeping functions until the spinlock can be >>>>>>>>> dropped. I >>>>>>>>> haven't looked closely at whether that's possible here, >>>>>>>>> but >>>>>>>>> it may be a >>>>>>>>> more reliable approach. >>>>>>>> >>>>>>>> That was proposed by Dai last week. >>>>>>>> >>>>>>>> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@xxxxxxxxxx/T/#u >>>>>>>> >>>>>>>> Trond pointed out that if two separate clients were >>>>>>>> releasing a >>>>>>>> lockowner on the same inode, there is nothing that protects >>>>>>>> the >>>>>>>> dispose_list, and it would get corrupted. >>>>>>>> >>>>>>>> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@xxxxxxxxxx/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 >>>>>>>> >>>>>>> >>>>>>> Yeah, that doesn't look like what's needed. >>>>>>> >>>>>>> What I was going to suggest is a nfsd_file_put variant that >>>>>>> takes >>>>>>> a >>>>>>> list_head. If the refcount goes to zero and the thing ends up >>>>>>> being >>>>>>> unhashed, then you put it on the dispose list rather than >>>>>>> doing >>>>>>> the >>>>>>> blocking operations, and then clean it up later. >>>>>> >>>>>> Trond doesn't like that approach; see the e-mail thread. >>>>>> >>>>> >>>>> I didn't see him saying that that would be wrong, per-se, but the >>>>> initial implementation was racy. >>>>> >>>>> His suggestion was just to keep a counter in the lockowner of how >>>>> many >>>>> locks are associated with it. That seems like a good suggestion, >>>>> though >>>>> you'd probably need to add a parameter to lm_get_owner to >>>>> indicate >>>>> whether you were adding a new lock or just doing a conflock copy. >>>> >>>> I don't think this should be necessary. The posix_lock code doesn't >>>> ever use a struct file_lock that it hasn't allocated itself. We >>>> should >>>> always be calling conflock to copy from whatever struct file_lock >>>> that >>>> the caller passed as an argument. >>>> >>>> IOW: the number of lm_get_owner and lm_put_owner calls should >>>> always be >>>> 100% balanced once all the locks belonging to a specific lock owner >>>> are >>>> removed. >>>> >>> >>> We take references to the owner when we go to add a lock record, or >>> when >>> copying a conflicting lock. You want to keep a count of the former >>> without counting the latter. >>> >>> lm_get_owner gets called for both though. I don't see how you can >>> disambiguate the two situations w/o some way to indicate that. Adding >>> a >>> bool argument to lm_get_owner/lm_put_owner ops would be pretty simple >>> to >>> implement, I think. >>> >> >> Hmm... That should be an extremely unlikely race, given that the >> conflicting lock reference would have to be held for long enough to >> cover the unlock + the release_lockowner / free_stateid RPC calls from >> the client initially holding the lock, however I agree it is a >> theoretical possibility. >> > > If we want to live with the possibility of that race, then Chuck's > original patch is fine, since the object refcount would always be > equivalent to the lock count. > > That said, I think it'd be better to keep an accurate count of lock > records (sans conflocks) in the owner. I don't have an objection to maintaining an accurate count of locks belonging to each lockowner. I feel comfortable (for now) that using so_count is a good-enough solution, and is an improvement over holding a spinlock while trying to sleep. If we go with so_count, I can add a comment that explains the uses of so_count and possible races. OTOH, if we alter the lm_get/put_owner API, I would like to ensure the new parameter will be difficult to misuse. A few more random thoughts that might count towards due diligence: - Is there a reliable way lm_get/put_owner can distinguish between the two cases by looking at their current set of arguments? - Is it clear why the conflict case needs to take a reference on a lockowner that is not involved? Is there a way to avoid taking that reference in the first place? -- Chuck Lever