Re: CB_LAYOUTRECALL "deadlock" with in-kernel flexfiles server and XFS

Jeff Layton <jlayton@xxxxxxxxxx> · Sat, 27 Jan 2018 16:41:41 -0500

On Sat, 2018-01-27 at 10:39 -0500, Benjamin Coddington wrote:
> On 11 Aug 2016, at 11:23, Jeff Layton wrote:
> 
> > I was playing around with the in-kernel flexfiles server today, and I
> > seem to be hitting a deadlock when using it on an XFS-exported
> > filesystem. Here's the stack trace of how the CB_LAYOUTRECALL occurs:
> > 
> > [  928.736139] CPU: 0 PID: 846 Comm: nfsd Tainted: G           OE   
> > 4.8.0-rc1+ #3
> > [  928.737040] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > BIOS 1.9.1-1.fc24 04/01/2014
> > [  928.738009]  0000000000000286 000000006125f50e ffff91153845b878 
> > ffffffff8f463853
> > [  928.738906]  ffff91152ec194d0 ffff91152d31d9c0 ffff91153845b8a8 
> > ffffffffc045936f
> > [  928.739788]  ffff91152c051980 ffff91152d31d9c0 ffff91152c051540 
> > ffff9115361b8a58
> > [  928.740697] Call Trace:
> > [  928.740998]  [<ffffffff8f463853>] dump_stack+0x86/0xc3
> > [  928.741570]  [<ffffffffc045936f>] 
> > nfsd4_recall_file_layout+0x17f/0x190 [nfsd]
> > [  928.742380]  [<ffffffffc045939d>] nfsd4_layout_lm_break+0x1d/0x30 
> > [nfsd]
> > [  928.743115]  [<ffffffff8f3056d8>] __break_lease+0x118/0x6a0
> > [  928.743759]  [<ffffffffc02dea69>] xfs_break_layouts+0x79/0x120 
> > [xfs]
> > [  928.744462]  [<ffffffffc029ea04>] 
> > xfs_file_aio_write_checks+0x94/0x1f0 [xfs]
> > [  928.745251]  [<ffffffffc029f36b>] 
> > xfs_file_buffered_aio_write+0x7b/0x330 [xfs]
> > [  928.746063]  [<ffffffffc029f70c>] xfs_file_write_iter+0xec/0x140 
> > [xfs]
> > [  928.746803]  [<ffffffff8f2a0599>] do_iter_readv_writev+0xb9/0x140
> > [  928.747478]  [<ffffffff8f2a126b>] do_readv_writev+0x19b/0x240
> > [  928.748146]  [<ffffffffc029f620>] ? 
> > xfs_file_buffered_aio_write+0x330/0x330 [xfs]
> > [  928.748956]  [<ffffffff8f29e02b>] ? do_dentry_open+0x28b/0x310
> > [  928.749614]  [<ffffffffc029c800>] ? 
> > xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
> > [  928.750367]  [<ffffffff8f2a156f>] vfs_writev+0x3f/0x50
> > [  928.750934]  [<ffffffffc04276ca>] nfsd_vfs_write+0xca/0x3a0 [nfsd]
> > [  928.751608]  [<ffffffffc0429ec5>] nfsd_write+0x485/0x780 [nfsd]
> > [  928.752263]  [<ffffffffc043144c>] nfsd3_proc_write+0xbc/0x150 
> > [nfsd]
> > [  928.752973]  [<ffffffffc0421388>] nfsd_dispatch+0xb8/0x1f0 [nfsd]
> > [  928.753642]  [<ffffffffc036d78f>] svc_process_common+0x42f/0x690 
> > [sunrpc]
> > [  928.754395]  [<ffffffffc036e8e8>] svc_process+0x118/0x330 [sunrpc]
> > [  928.755080]  [<ffffffffc04208ac>] nfsd+0x19c/0x2b0 [nfsd]
> > [  928.755681]  [<ffffffffc0420715>] ? nfsd+0x5/0x2b0 [nfsd]
> > [  928.756274]  [<ffffffffc0420710>] ? nfsd_destroy+0x190/0x190 [nfsd]
> > [  928.756991]  [<ffffffff8f0d5891>] kthread+0x101/0x120
> > [  928.757563]  [<ffffffff8f10dcc5>] ? 
> > trace_hardirqs_on_caller+0xf5/0x1b0
> > [  928.758282]  [<ffffffff8f8f2fef>] ret_from_fork+0x1f/0x40
> > [  928.758875]  [<ffffffff8f0d5790>] ? 
> > kthread_create_on_node+0x250/0x250
> > 
> > 
> > So the client gets a flexfiles layout, and then tries to issue a v3
> > WRITE against the file. XFS then recalls the layout, but the client
> > can't return the layout until the v3 WRITE completes. Eventually this
> > should resolve itself after 2 lease periods, but that's quite a long
> > time.
> > 
> > I guess XFS requires recalling block and SCSI layouts when the server
> > wants to issue a write (or someone writes to it locally), but that
> > seems like it shouldn't be happening when the layout is a flexfiles
> > layout.
> > 
> > Any thoughts on what the right fix is here?
> > 
> > On a related note, knfsd will spam the heck out of the client with
> > CB_LAYOUTRECALLs during this time. I think we ought to consider fixing
> > the server not to treat an NFS_OK return from the client like
> > NFS4ERR_DELAY there, but that would mean a different mechanism for
> > timing out a CB_LAYOUTRECALL.
> 
> I'm getting into similar trouble with SCSI layouts when the client ends 
> up
> submitting a WRITE because the IO is not page aligned, but it already 
> holds
> a layout for that range.  It looks like the server sends a 
> CB_LAYOUTRECALL,
> but the client has to answer NFS4ERR_DELAY because it is still holding 
> the
> layout.
> 
> Probably, the client should return any layouts it holds for that range 
> before
> doing IO through the MDS.
> 

Yes, that might be good. Could even prefix the WRITE compound with a
LAYOUTRETURN if you want to get fancy. :)

> Alternatively, shouldn't the MDS accept IO from the same client that 
> holds a
> layout for that range, rather than recall that layout?  RFC 5661 Section
> 20.3.4 talks about the client submitting WRITEs before responding to
> CB_LAYOUTRECALL: "As always, the client may write the data through the
> metadata server."
> 

Agreed. That seems reasonable too.

> I'm trying to find the discussion that resulted in this commit:
> 
> commit 6b9b21073d3b250e17812cd562fffc9006962b39
> Author: Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
> Date:   Tue Dec 8 07:23:48 2015 -0500
> 
>      nfsd: give up on CB_LAYOUTRECALLs after two lease periods
> 
> Why should we poll the client if the client answers with NFS4ERR_DELAY?  
> Can
> we instead just wait for the layout to be returned?
> 

No. NFS4ERR_DELAY just means "I'm too busy to answer right now, please
call again later". You can't infer that the client has made any note of
the CB_LAYOUTRECALL at all since it didn't succeed.

Returning NFS4_OK on a CB_LAYOUTRECALL just means that you acknowledge
that it has been recalled and will eventually send a LAYOUTRETURN. It
doesn't mean that you are immediately returning it.

Probably what the client should do in this situation is mark the layout
as having been recalled and return NFS4_OK instead of NFS4ERR_DELAY. It
seems like that ought to be possible, but I haven't looked at the code
to see why that isn't occurring.

> Also, I think the 2*lease period timeout is currently broken because we 
> reset
> tk_start after every call.. but that's not really causing any trouble.
> 

It'd be good to fix that too, since you're in there...

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html