On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@xxxxxxxxxx> > > > wrote: > > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@xxxxxxxxxx> > > > > > wrote: > > > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls > > > > > > check_for_locks(). However, check_for_locks() calls > > > > > > nfsd_file_get() > > > > > > / nfsd_file_put() to access the backing inode's flc_posix > > > > > > list, and > > > > > > nfsd_file_put() can sleep if the inode was recently > > > > > > removed. > > > > > > > > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > > > > > > > I intend to include the patch you reviewed last week that > > > > adds the might_sleep(), as part of this series. > > > > > > > > > > > > > > Let's instead rely on the stateowner's reference count to > > > > > > gate > > > > > > whether the release is permitted. This should be a reliable > > > > > > indication of locks-in-use since file lock operations and > > > > > > ->lm_get_owner take appropriate references, which are > > > > > > released > > > > > > appropriately when file locks are removed. > > > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@xxxxxxxxxx> > > > > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > > > > > Cc: stable@xxxxxxxxxxxxxxx > > > > > > --- > > > > > > fs/nfsd/nfs4state.c | 9 +++------ > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > > > > > > > This might be a naive approach, but let's start with it. > > > > > > > > > > > > This passes light testing, but it's not clear how much our > > > > > > existing > > > > > > fleet of tests exercises this area. I've locally built a > > > > > > couple of > > > > > > pynfs tests (one is based on the one Dai posted last week) > > > > > > and they > > > > > > pass too. > > > > > > > > > > > > I don't believe that FREE_STATEID needs the same > > > > > > simplification. > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > index a280256cbb03..b77894e668a4 100644 > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct > > > > > > svc_rqst *rqstp, > > > > > > > > > > > > /* see if there are still any locks > > > > > > associated with it */ > > > > > > lo = lockowner(sop); > > > > > > - list_for_each_entry(stp, &sop->so_stateids, > > > > > > st_perstateowner) { > > > > > > - if (check_for_locks(stp- > > > > > > >st_stid.sc_file, lo)) { > > > > > > - status = nfserr_locks_held; > > > > > > - spin_unlock(&clp->cl_lock); > > > > > > - return status; > > > > > > - } > > > > > > + if (atomic_read(&sop->so_count) > 1) { > > > > > > + spin_unlock(&clp->cl_lock); > > > > > > + return nfserr_locks_held; > > > > > > } > > > > > > > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > > > > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if > > > > > someone else > > > > > happens to be doing a LOCKT or F_GETLK call at the same time > > > > > that > > > > > RELEASE_LOCKOWNER gets called, then this may end up returning > > > > > an error > > > > > inappropriately. > > > > > > > > IMO releasing the lockowner while it's being used for > > > > _anything_ > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while > > > > the client is still using the lockowner for any reason, a > > > > subsequent error will occur if the client tries to use it > > > > again. > > > > Heck, I can see the server failing in mid-COMPOUND with this > > > > kind > > > > of race. Better I think to just leave the lockowner in place if > > > > there's any ambiguity. > > > > > > > > > > The problem here is not the client itself calling > > > RELEASE_LOCKOWNER > > > while it's still in use, but rather a different client altogether > > > calling LOCKT (or a local process does a F_GETLK) on an inode > > > where a > > > lock is held by a client. The LOCKT gets a reference to it (for > > > the > > > conflock), while the client that has the lockowner releases the > > > lock and > > > then the lockowner while the refcount is still high. > > > > > > The race window for this is probably quite small, but I think > > > it's > > > theoretically possible. The point is that an elevated refcount on > > > the > > > lockowner doesn't necessarily mean that locks are actually being > > > held by > > > it. > > > > Sure, I get that the lockowner's reference count is not 100% > > reliable. The question is whether it's good enough. > > > > We are looking for a mechanism that can simply count the number > > of locks held by a lockowner. It sounds like you believe that > > lm_get_owner / put_owner might not be a reliable way to do > > that. > > > > > > > > The spec language does not say RELEASE_LOCKOWNER must not > > > > return > > > > LOCKS_HELD for other reasons, and it does say that there is no > > > > choice of using another NFSERR value (RFC 7530 Section 13.2). > > > > > > > > > > What recourse does the client have if this happens? It released > > > all of > > > its locks and tried to release the lockowner, but the server says > > > "locks > > > held". Should it just give up at that point? RELEASE_LOCKOWNER is > > > a sort > > > of a courtesy by the client, I suppose... > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients > > ignore the return code IIUC. > > > > So the hazard caused by this race would be a small resource > > leak on the server that would go away once the client's lease > > was purged. > > > > > > > > > My guess is that that would be pretty hard to hit the > > > > > timing right, but not impossible. > > > > > > > > > > What we may want to do is have the kernel do this check and > > > > > only if it > > > > > comes back >1 do the actual check for locks. That won't fix > > > > > the original > > > > > problem though. > > > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list head > > > > > and > > > > > deferred the sleeping functions until the spinlock can be > > > > > dropped. I > > > > > haven't looked closely at whether that's possible here, but > > > > > it may be a > > > > > more reliable approach. > > > > > > > > That was proposed by Dai last week. > > > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@xxxxxxxxxx/T/#u > > > > > > > > Trond pointed out that if two separate clients were releasing a > > > > lockowner on the same inode, there is nothing that protects the > > > > dispose_list, and it would get corrupted. > > > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@xxxxxxxxxx/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > > > > > > > > > > Yeah, that doesn't look like what's needed. > > > > > > What I was going to suggest is a nfsd_file_put variant that takes > > > a > > > list_head. If the refcount goes to zero and the thing ends up > > > being > > > unhashed, then you put it on the dispose list rather than doing > > > the > > > blocking operations, and then clean it up later. > > > > Trond doesn't like that approach; see the e-mail thread. > > > > I didn't see him saying that that would be wrong, per-se, but the > initial implementation was racy. > > His suggestion was just to keep a counter in the lockowner of how > many > locks are associated with it. That seems like a good suggestion, > though > you'd probably need to add a parameter to lm_get_owner to indicate > whether you were adding a new lock or just doing a conflock copy. I don't think this should be necessary. The posix_lock code doesn't ever use a struct file_lock that it hasn't allocated itself. We should always be calling conflock to copy from whatever struct file_lock that the caller passed as an argument. IOW: the number of lm_get_owner and lm_put_owner calls should always be 100% balanced once all the locks belonging to a specific lock owner are removed. > > Checking the object refcount like this patch does seems wrong though. > Yes. This approach does require a separate counter that is only bumped/decremented in the lock manager callbacks. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx