Re: [PATCH RFC] nfsd: serialize layout stateid morphing operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 1 Dec 2015 12:56:00 +0100
Christoph Hellwig <hch@xxxxxx> wrote:

> On Mon, Nov 30, 2015 at 07:33:13PM -0500, Jeff Layton wrote:
> > On Mon, 30 Nov 2015 16:34:20 -0500
> > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> wrote:
> > 
> > > On Mon, Nov 30, 2015 at 10:57:07AM +0800, Kinglong Mee wrote:
> > > > On 11/29/2015 21:46, Jeff Layton wrote:
> > > > > On Sun, 29 Nov 2015 12:07:48 +0800
> > > > > Kinglong Mee <kinglongmee@xxxxxxxxx> wrote:
> > > > > 
> > > > >> I meet two problems with this patch,
> > > > >>
> > > > > 
> > > > > Ok, I was less sure of this patch than the other one. I think we will
> > > > > need to serialize these operations, but this may not be the best way to
> > > > > do it.
> > > > > 
> > > > > Bruce if you want to back this one for now, then I'm fine with that. It
> > > > > may be a bit before I can get to it.
> > > > > 
> > > > > 
> > > > >> 1. a dump_stack messages printed in process_one_work() at kernel/workqueue.c.
> > > > >>
> > > > >> BUG: workqueue leaked lock or atomic: kworker/u2:4/0x00000000/98#012     last function: nfsd4_run_cb_work [nfsd]
> > > > >> 1 lock held by kworker/u2:4/98:
> > > > >> #0:  (&ls->ls_mutex){......}, at: [<ffffffffa0250d34>] nfsd4_cb_layout_prepare+0x24/0x40 [nfsd]
> > > > >> CPU: 0 PID: 98 Comm: kworker/u2:4 Not tainted 4.4.0-rc2+ #333
> > > > >> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> > > > >> Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
> > > > >> ffff8800362b9e40 000000007fe9394f ffff880036353d58 ffffffff8136dc64
> > > > >> ffff880036353dd8 ffffffff810a3f12 ffffffff810a3cbd 000000000000000a
> > > > >> ffffffffa0261d78 ffffffff82902e20 0000000000000000 ffffffffa0259241
> > > > >> Call Trace:
> > > > >> [<ffffffff8136dc64>] dump_stack+0x19/0x25
> > > > >> [<ffffffff810a3f12>] process_one_work+0x3c2/0x4c0
> > > > >> [<ffffffff810a3cbd>] ? process_one_work+0x16d/0x4c0
> > > > >> [<ffffffff810a405a>] worker_thread+0x4a/0x440
> > > > >> [<ffffffff810a4010>] ? process_one_work+0x4c0/0x4c0
> > > > >> [<ffffffff810a4010>] ? process_one_work+0x4c0/0x4c0
> > > > >> [<ffffffff810a91e5>] kthread+0xf5/0x110
> > > > >> [<ffffffff810a90f0>] ? kthread_create_on_node+0x240/0x240
> > > > >> [<ffffffff81738d0f>] ret_from_fork+0x3f/0x70
> > > > >> [<ffffffff810a90f0>] ? kthread_create_on_node+0x240/0x240
> > > > >>
> > > > > 
> > > > > 
> > > > > I guess this is just a general "hey you left a mutex locked" warning
> > > > > when finishing a workqueue job? This patch actually does want to do
> > > > > that, but I trying to tell that to lockdep may be tricky...
> > > > 
> > > > Yes, it's just a warning.
> > > > But, it's terrible that kernel prints it every time
> > > > when calling nfsd4_cb_layout_prepare.
> > > > 
> > > > > 
> > > > > 
> > > > >> 2. a mutex race between layoutrecall and layoutcommit, 
> > > > >>
> > > > >> callback thread,
> > > > >> nfsd4_cb_layout_prepare 
> > > > >>   --->mutex_lock(&ls->ls_mutex);
> > > > >>
> > > > >> layoutcommit thread,
> > > > >> nfsd4_layoutcommit
> > > > >>   ---> nfsd4_preprocess_layout_stateid
> > > > >>      --> mutex_lock(&ls->ls_mutex);   <----------------  hang 
> > > > >>
> > > > >> [  600.645142] INFO: task nfsd:11623 blocked for more than 120 seconds.
> > > > >> [  600.646337]       Not tainted 4.4.0-rc2+ #332
> > > > >> [  600.647404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > >> [  600.648546] nfsd            D ffff880064277b80     0 11623      2 0x00000000
> > > > >> [  600.649803]  ffff880064277b80 ffff880064278000 00000000ffffffff ffff88005dd241a8
> > > > >> [  600.651021]  ffffffffa025e77c 0000000000000246 ffff880064277b98 ffffffff81733103
> > > > >> [  600.652255]  ffff880063d7e100 ffff880064277ba8 ffffffff8173330e ffff880064277c28
> > > > >> [  600.653512] Call Trace:
> > > > >> [  600.654765]  [<ffffffffa025e77c>] ? nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
> > > > >> [  600.656084]  [<ffffffff81733103>] schedule+0x33/0x80
> > > > >> [  600.657405]  [<ffffffff8173330e>] schedule_preempt_disabled+0xe/0x10
> > > > >> [  600.658741]  [<ffffffff817357f5>] mutex_lock_nested+0x145/0x330
> > > > >> [  600.660094]  [<ffffffffa025e77c>] ? nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
> > > > >> [  600.661696]  [<ffffffffa025e77c>] nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
> > > > >> [  600.663129]  [<ffffffffa025e405>] ? nfsd4_preprocess_layout_stateid+0x5/0x400 [nfsd]
> > > > >> [  600.664558]  [<ffffffff81173b8f>] ? printk+0x56/0x72
> > > > >> [  600.665990]  [<ffffffffa023e3ec>] nfsd4_layoutcommit+0x13c/0x200 [nfsd]
> > > > >> [  600.667365]  [<ffffffffa023fb98>] nfsd4_proc_compound+0x388/0x660 [nfsd]
> > > > >> [  600.668835]  [<ffffffffa022c148>] nfsd_dispatch+0xb8/0x200 [nfsd]
> > > > >> [  600.670323]  [<ffffffffa0093d89>] svc_process_common+0x409/0x650 [sunrpc]
> > > > >> [  600.671836]  [<ffffffffa0094e04>] svc_process+0xf4/0x190 [sunrpc]
> > > > >> [  600.673328]  [<ffffffffa022bb05>] nfsd+0x135/0x1a0 [nfsd]
> > > > >> [  600.674825]  [<ffffffffa022b9d5>] ? nfsd+0x5/0x1a0 [nfsd]
> > > > >> [  600.676388]  [<ffffffffa022b9d0>] ? nfsd_destroy+0xb0/0xb0 [nfsd]
> > > > >> [  600.677884]  [<ffffffff810a9175>] kthread+0xf5/0x110
> > > > >> [  600.679373]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
> > > > >> [  600.680874]  [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70
> > > > >> [  600.682398]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
> > > > >> [  600.683893] 1 lock held by nfsd/11623:
> > > > >> [  600.685449]  #0:  (&ls->ls_mutex){......}, at: [<ffffffffa025e77c>] nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd]
> > > > >> [  600.688778] Sending NMI to all CPUs:
> > > > >> [  600.690854] NMI backtrace for cpu 0
> > > > >> [  600.691909] CPU: 0 PID: 11 Comm: khungtaskd Not tainted 4.4.0-rc2+ #332
> > > > >> [  600.692523] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> > > > >> [  600.693821] task: ffff88007b900000 ti: ffff88007b8fc000 task.ti: ffff88007b8fc000
> > > > >> [  600.694496] RIP: 0010:[<ffffffff81053aca>]  [<ffffffff81053aca>] native_write_msr_safe+0xa/0x10
> > > > >> [  600.695185] RSP: 0018:ffff88007b8ffd70  EFLAGS: 00000046
> > > > >> [  600.695861] RAX: 0000000000000400 RBX: 0000000000000286 RCX: 0000000000000830
> > > > >> [  600.696539] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000830
> > > > >> [  600.697204] RBP: ffff88007b8ffd70 R08: 0000000000000400 R09: 0000000000000000
> > > > >> [  600.697862] R10: 00000000000000e4 R11: 0000000000000001 R12: ffff880063d7e100
> > > > >> [  600.698513] R13: 00000000003fff3c R14: ffff880063d7e308 R15: 0000000000000004
> > > > >> [  600.699156] FS:  0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000
> > > > >> [  600.699823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >> [  600.700459] CR2: 00007f366afd4000 CR3: 000000005dd56000 CR4: 00000000001406f0
> > > > >> [  600.701106] Stack:
> > > > >> [  600.701745]  ffff88007b8ffd88 ffffffff8104a860 ffffffff81047340 ffff88007b8ffd98
> > > > >> [  600.702404]  ffffffff8104a885 ffff88007b8ffda8 ffffffff8104735b ffff88007b8ffdd8
> > > > >> [  600.703058]  ffffffff813723ad 0000000000000078 ffff880063d7e100 00000000003fff3c
> > > > >> [  600.703712] Call Trace:
> > > > >> [  600.704355]  [<ffffffff8104a860>] __x2apic_send_IPI_mask.isra.2+0x60/0x70
> > > > >> [  600.705017]  [<ffffffff81047340>] ? setup_vector_irq+0x130/0x130
> > > > >> [  600.705676]  [<ffffffff8104a885>] x2apic_send_IPI_mask+0x15/0x20
> > > > >> [  600.706335]  [<ffffffff8104735b>] nmi_raise_cpu_backtrace+0x1b/0x20
> > > > >> [  600.706989]  [<ffffffff813723ad>] nmi_trigger_all_cpu_backtrace+0x14d/0x1c0
> > > > >> [  600.707693]  [<ffffffff810473b9>] arch_trigger_all_cpu_backtrace+0x19/0x20
> > > > >> [  600.708362]  [<ffffffff8112c4cf>] watchdog+0x32f/0x370
> > > > >> [  600.709031]  [<ffffffff8112c221>] ? watchdog+0x81/0x370
> > > > >> [  600.709725]  [<ffffffff8112c1a0>] ? reset_hung_task_detector+0x20/0x20
> > > > >> [  600.710398]  [<ffffffff810a9175>] kthread+0xf5/0x110
> > > > >> [  600.711067]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
> > > > >> [  600.711739]  [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70
> > > > >> [  600.712405]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
> > > > >> [  600.713073] Code: 00 55 89 f9 48 89 e5 0f 32 45 31 c0 48 c1 e2 20 44 89 06 48 09 d0 5d c3 66 0f 1f 84 00 00 00 00 00 55 89 f0 89 f9 48 89 e5 0f 30 <31> c0 5d c3 66 90 55 89 f9 48 89 e5 0f 33 48 c1 e2 20 48 09 d0
> > > > >> [  600.715196] Kernel panic - not syncing: hung_task: blocked tasks
> > > > >> [  600.715889] CPU: 0 PID: 11 Comm: khungtaskd Not tainted 4.4.0-rc2+ #332
> > > > >> [  600.716540] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> > > > >> [  600.717910]  ffffffff81a4d19c 00000000480750be ffff88007b8ffd60 ffffffff8136dbf4
> > > > >> [  600.718610]  ffff88007b8ffde8 ffffffff81173559 ffff880000000008 ffff88007b8ffdf8
> > > > >> [  600.719302]  ffff88007b8ffd90 00000000480750be 0000000000000001 0000000000000001
> > > > >> [  600.719984] Call Trace:
> > > > >> [  600.720646]  [<ffffffff8136dbf4>] dump_stack+0x19/0x25
> > > > >> [  600.721330]  [<ffffffff81173559>] panic+0xd3/0x212
> > > > >> [  600.722009]  [<ffffffff8112c4db>] watchdog+0x33b/0x370
> > > > >> [  600.722686]  [<ffffffff8112c221>] ? watchdog+0x81/0x370
> > > > >> [  600.723213]  [<ffffffff8112c1a0>] ? reset_hung_task_detector+0x20/0x20
> > > > >> [  600.723674]  [<ffffffff810a9175>] kthread+0xf5/0x110
> > > > >> [  600.724107]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
> > > > >> [  600.724509]  [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70
> > > > >> [  600.724903]  [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240
> > > > >>
> > > > >> thanks,
> > > > >> Kinglong Mee
> > > > >>
> > > > > 
> > > > > This is the bigger problem, I think. The question of course is why the
> > > > > client didn't respond to the cb request? Still, holding a mutex across
> > > > > the callback RPC is pretty ugly and now that I think about it, I don't
> > > > > think it's really necessary anyway. Once we've copied the stateid, we
> > > > > aren't really changing anything else so we can let other layout ops
> > > > > proceed.
> > > > > 
> > > > > Would a patch like this fix it?
> > > > 
> > > > Yes, it's great.
> > > > With this patch, every thing is okay.
> > > > 
> > > > > 
> > > > > -----------------------8<--------------------------
> > > > > 
> > > > > [PATCH] nfsd: don't hold ls_mutex across a layout recall
> > > > > 
> > > > > We do need to serialize layout stateid morphing operations, but we
> > > > > currently hold the ls_mutex across a layout recall which is pretty
> > > > > ugly. It's also unnecessary -- once we've bumped the seqid and
> > > > > copied it, we don't need to serialize the rest of the CB_LAYOUTRECALL
> > > > > vs. anything else. Just drop the mutex once the copy is done.
> > > > > 
> > > > > Fixes: cc8a55320b5f "nfsd: serialize layout stateid morphing operations"
> > > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > > Reported-by: Kinglong Mee <kinglongmee@xxxxxxxxx>
> > > > > Signed-off-by: Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx>
> > > > 
> > > > Tested-by: Kinglong Mee <kinglongmee@xxxxxxxxx>
> > > 
> > > Thanks for the report and the testing.
> > > 
> > > It may be worth applying just as a stopgap, but I'm not convinced this
> > > is right yet....
> > > 
> > > I guess I'd expect the stateid bump to be atomic with the actual change
> > > to the layout state.  We're bumping the layout stateid on sending the
> > > callback, and modifying the actual layout state on receive.  So after
> > > the below patch we've left a window during which nfsd threads see the
> > > new layout stateid but the old layout state.  I wonder what the
> > > practical consequences of that are?
> > > 
> > > My temptation is to argue that the layout processing currently done in
> > > the release method should be done on sending (doesn't look like it's
> > > conditional on the result of the callback, so what are we waiting for?).
> > > But I'm not sure that's right.  I'll go read the spec....
> > > 
> > > --b.
> > > 
> > 
> > Agreed. The basic rule (AIUI) is that if you change anything in the
> > layout then you should bump the seqid.
> > 
> > But... is the layout processing in that release method supposed to be
> > done at all in CB_LAYOUTRECALL? The spec says:
> > 
> > -------------------8<-----------------------
> >    The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL
> >    (recall of file delegations) in that the client responds to the
> >    request before actually returning layouts via the LAYOUTRETURN
> >    operation.  While the client responds to the CB_LAYOUTRECALL
> >    immediately, the operation is not considered complete (i.e.,
> >    considered pending) until all affected layouts are returned to the
> >    server via the LAYOUTRETURN operation
> > -------------------8<-----------------------
> > 
> > It doesn't seem like we ought to be tearing down layouts at that
> > point, but rather just notifying the client that they should be
> > returned. Revoking the layout while the client still might have it in
> > use seems like it could be problematic. Is there something I'm missing
> > there?
> 
> Yes.  My impression based on the Linux client and the delgation
> code was that we'd expect the client to either return
> NFS4ERR_NOMATCHING_LAYOUT, in which case the server can reclaim the
> layouts, or NFS4ERR_DELAY if it needs more time to process the layouts.
> 
> But for non-forgetful clients I wonder if returning 0 should be
> interpreted the same as NFS4ERR_DELAY?  Note that we still need to
> time out the client if it doesn't respond in time, so NFS4ERR_DELAY
> seems better than 0, but the standard doesn't really talk about
> return values other than NFS4ERR_NOMATCHING_LAYOUT.

My interpretation is somewhat different. To me, this is how we'd
interpret the response from the client (pseudocode):

NFS_OK:
	/* Message received. I'll start returning these layouts soon. */
NFS4ERR_DELAY:
	/* I'm too resource constrained to even process this simple
           request right now. Please ask me again in a bit. */
NFS4ERR_NOMATCHING_LAYOUT:
	/* Huh? What layout? */

...IMO, the spec is pretty clear that a successful response from the
client just means that it got the message that it should start
returning layouts. If it happens to return anything before the cb
response, then that's just luck/coincidence. The server shouldn't count
on that.

-- 
Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux