> On Jan 28, 2023, at 10:20 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Sat, 2023-01-28 at 14:15 +0000, Chuck Lever III wrote: >> [ Cc'ing the original author of this code. ] >> >> Proposed patch is here: >> >> https://lore.kernel.org/linux-nfs/979eebe94ef380af6a5fdb831e78fd4c0946a59e.1674836262.git.bcodding@xxxxxxxxxx/ >> >>> On Jan 28, 2023, at 8:47 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>> >>> On Sat, 2023-01-28 at 08:31 -0500, Benjamin Coddington wrote: >>>> On 27 Jan 2023, at 13:03, Jeff Layton wrote: >>>> >>>>> On Fri, 2023-01-27 at 11:42 -0500, Benjamin Coddington wrote: >>>>>> On 27 Jan 2023, at 11:34, Chuck Lever III wrote: >>>>>> >>>>>>>> On Jan 27, 2023, at 11:18 AM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> Its possible for __break_lease to find the layout's lease before we've >>>>>>>> added the layout to the owner's ls_layouts list. In that case, setting >>>>>>>> ls_recalled = true without actually recalling the layout will cause the >>>>>>>> server to never send a recall callback. >>>>>>>> >>>>>>>> Move the check for ls_layouts before setting ls_recalled. >>>>>>>> >>>>>>>> Signed-off-by: Benjamin Coddington <bcodding@xxxxxxxxxx> >>>>>>> >>>>>>> Did this start misbehaving recently, or has it always been broken? >>>>>>> That is, does it need: >>>>>>> >>>>>>> Fixes: c5c707f96fc9 ("nfsd: implement pNFS layout recalls") ? >>>>>> >>>>>> I'm doing some new testing of racing LAYOUTGET and CB_LAYOUTRETURN after >>>>>> running into a livelock, so I think it has always been broken and the Fixes >>>>>> tag is probably appropriate. >>>>>> >>>>>> However, now I'm wondering if we'd run into trouble if ls_layouts could be >>>>>> empty but the lease still exist.. but that seems like it would be a >>>>>> different bug. >>>>>> >>>>> >>>>> Yeah, is that even possible? Surely once the last layout is gone, we >>>>> drop the stateid? In any case, this patch looks fine. You can add: >>>>> >>>>> Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx> >>>> >>>> Jeff pointed out that there's another problem here. We can't just skip >>>> sending the callback if ls_layouts is empty, because then the process trying >>>> to break the lease will end up spinning in __break_lease. >>>> >>>> I think we can drop the list_empty() check altogether - it must be there so >>>> that we don't race in and send a callback for a layout that's already been >>>> returned, but I don't see any harm in that. Clients should just return >>>> NO_MATCHING_LAYOUT. >>>> >>> >>> The bigger worry (AFAICS) is that there is a potential race between >>> LAYOUTGET and CB_LAYOUTRECALL: >>> >>> The lease is set very early in the LAYOUTGET process, and it can be >>> broken at any time beyond that point, even before LAYOUTGET is done and >>> has populated the ls_layouts list. If __break_lease gets called before >>> the list is populated, then the recall won't be sent (because ls_layouts >>> is still empty), but the LAYOUTGET will still complete successfully. >>> >>> I think we need a check at the end of nfsd4_layoutget, after the >>> nfsd4_insert_layout call to see whether the lease has been broken. If it >>> has, then we should unwind everything and return NFS4ERR_RECALLCONFLICT. >> >> Shall I drop this fix from nfsd-next, then? >> > > No, I think Ben's fix is still valid. The problem I'm seeing is a > different issue in the same area of the code. A follow-on patch to > address that is appropriate. Thanks for clarifying... I wasn't sure whether y'all were planning a replacement patch or an addendum. Sounds like the latter, so I'll leave Ben's fix on the queue. -- Chuck Lever