> On Aug 11, 2016, at 12:06, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Thu, 2016-08-11 at 15:55 +0000, Trond Myklebust wrote: >>> >>> On Aug 11, 2016, at 11:23, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>> >>> I was playing around with the in-kernel flexfiles server today, and >>> I >>> seem to be hitting a deadlock when using it on an XFS-exported >>> filesystem. Here's the stack trace of how the CB_LAYOUTRECALL >>> occurs: >>> >>> [ 928.736139] CPU: 0 PID: 846 Comm: nfsd Tainted: >>> G OE 4.8.0-rc1+ #3 >>> [ 928.737040] Hardware name: QEMU Standard PC (i440FX + PIIX, >>> 1996), BIOS 1.9.1-1.fc24 04/01/2014 >>> [ 928.738009] 0000000000000286 000000006125f50e ffff91153845b878 >>> ffffffff8f463853 >>> [ 928.738906] ffff91152ec194d0 ffff91152d31d9c0 ffff91153845b8a8 >>> ffffffffc045936f >>> [ 928.739788] ffff91152c051980 ffff91152d31d9c0 ffff91152c051540 >>> ffff9115361b8a58 >>> [ 928.740697] Call Trace: >>> [ 928.740998] [<ffffffff8f463853>] dump_stack+0x86/0xc3 >>> [ 928.741570] [<ffffffffc045936f>] >>> nfsd4_recall_file_layout+0x17f/0x190 [nfsd] >>> [ 928.742380] [<ffffffffc045939d>] >>> nfsd4_layout_lm_break+0x1d/0x30 [nfsd] >>> [ 928.743115] [<ffffffff8f3056d8>] __break_lease+0x118/0x6a0 >>> [ 928.743759] [<ffffffffc02dea69>] xfs_break_layouts+0x79/0x120 >>> [xfs] >>> [ 928.744462] [<ffffffffc029ea04>] >>> xfs_file_aio_write_checks+0x94/0x1f0 [xfs] >>> [ 928.745251] [<ffffffffc029f36b>] >>> xfs_file_buffered_aio_write+0x7b/0x330 [xfs] >>> [ 928.746063] [<ffffffffc029f70c>] xfs_file_write_iter+0xec/0x140 >>> [xfs] >>> [ 928.746803] [<ffffffff8f2a0599>] >>> do_iter_readv_writev+0xb9/0x140 >>> [ 928.747478] [<ffffffff8f2a126b>] do_readv_writev+0x19b/0x240 >>> [ 928.748146] [<ffffffffc029f620>] ? >>> xfs_file_buffered_aio_write+0x330/0x330 [xfs] >>> [ 928.748956] [<ffffffff8f29e02b>] ? do_dentry_open+0x28b/0x310 >>> [ 928.749614] [<ffffffffc029c800>] ? >>> xfs_extent_busy_ag_cmp+0x20/0x20 [xfs] >>> [ 928.750367] [<ffffffff8f2a156f>] vfs_writev+0x3f/0x50 >>> [ 928.750934] [<ffffffffc04276ca>] nfsd_vfs_write+0xca/0x3a0 >>> [nfsd] >>> [ 928.751608] [<ffffffffc0429ec5>] nfsd_write+0x485/0x780 [nfsd] >>> [ 928.752263] [<ffffffffc043144c>] nfsd3_proc_write+0xbc/0x150 >>> [nfsd] >>> [ 928.752973] [<ffffffffc0421388>] nfsd_dispatch+0xb8/0x1f0 >>> [nfsd] >>> [ 928.753642] [<ffffffffc036d78f>] svc_process_common+0x42f/0x690 >>> [sunrpc] >>> [ 928.754395] [<ffffffffc036e8e8>] svc_process+0x118/0x330 >>> [sunrpc] >>> [ 928.755080] [<ffffffffc04208ac>] nfsd+0x19c/0x2b0 [nfsd] >>> [ 928.755681] [<ffffffffc0420715>] ? nfsd+0x5/0x2b0 [nfsd] >>> [ 928.756274] [<ffffffffc0420710>] ? nfsd_destroy+0x190/0x190 >>> [nfsd] >>> [ 928.756991] [<ffffffff8f0d5891>] kthread+0x101/0x120 >>> [ 928.757563] [<ffffffff8f10dcc5>] ? >>> trace_hardirqs_on_caller+0xf5/0x1b0 >>> [ 928.758282] [<ffffffff8f8f2fef>] ret_from_fork+0x1f/0x40 >>> [ 928.758875] [<ffffffff8f0d5790>] ? >>> kthread_create_on_node+0x250/0x250 >>> >>> >>> So the client gets a flexfiles layout, and then tries to issue a v3 >>> WRITE against the file. XFS then recalls the layout, but the client >>> can't return the layout until the v3 WRITE completes. Eventually >>> this >>> should resolve itself after 2 lease periods, but that's quite a >>> long >>> time. >> >> What’s the sequence of operations here? If the client has outstanding >> I/O, I should now be returning NFS_OK, and then completing the recall >> with a LAYOUTRETURN as soon as the outstanding I/O (and layoutcommit, >> if one is due) is done. >> >> The server is expected to return NFS4ERR_RECALLCONFLICT to any >> LAYOUTGET attempts that occur before the LAYOUTRETURN. >> > > Basically, I'm just doing this on the client: > > $ echo "foo" > /mnt/knfsdsrv/testfile > > > The client does: > > OPEN > LAYOUTGET (for RW) > GETDEVICEINFO > > ...and then a v3 WRITE under the aegis of the layout it got. > > The server then issues a CB_LAYOUTRECALL (because XFS wants to do that > whenever there is a local write, apparently). The client returns > NFS_OK, but it can't return the layout until the v3 WRITE completes. > The v3 write is hung though because it's waiting for the layout to be > returned. Oh… So this is an artifact of the write being local, and XFS having a path to recall the layout that it really shouldn’t have in the flexfiles case? > >>> >>> >>> I guess XFS requires recalling block and SCSI layouts when the >>> server >>> wants to issue a write (or someone writes to it locally), but that >>> seems like it shouldn't be happening when the layout is a flexfiles >>> layout. >>> >>> Any thoughts on what the right fix is here? >>> >>> On a related note, knfsd will spam the heck out of the client with >>> CB_LAYOUTRECALLs during this time. I think we ought to consider >>> fixing >>> the server not to treat an NFS_OK return from the client like >>> NFS4ERR_DELAY there, but that would mean a different mechanism for >>> timing out a CB_LAYOUTRECALL. >> >> There is a big difference between NFS_OK and NFS4ERR_DELAY as far as >> the server is concerned: >> >> - NFS_OK means that the client has now seen the stateid with the >> updated sequence id that was sent in CB_LAYOUTRECALL, and is >> processing it. No resend of the CB_LAYOUTRECALL is required. >> - OTOH, NFS4ERR_DELAY means the same thing in the back channel as it >> does in the forward channel: I’m busy and cannot process your >> request, please resend it later. > > Right. The current code basically just treats them the same as a > mechanism to handle eventually timing out the layoutrecall. The extra > CB_LAYOUTRECALLs are entirely superfluous. It's probably not too hard > to fix, but we'd need to come up with some other mechanism for timing > out the layoutrecall. > > -- > Jeff Layton <jlayton@xxxxxxxxxx> ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥