Re: task hung on target unload

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 28 Jan 2014 18:33:10 -0800

On Tue, 2014-01-28 at 11:57 -0800, Nicholas A. Bellinger wrote:
> Hi Tommy,
> 
> On Sat, 2014-01-25 at 15:19 +0100, Tommy Apel wrote:
> > Hello, after disconnecting my srp initiator and trying to shut down
> > the target I end up with a hung/stale system, I have experienced this
> > on both 3.13.0 and 3.10.25
> > 
> > Here is the dmesg
> > 
> > [170319.904119] Received DREQ and sent DREP for session 0x00000000000000000002c9030005566e.
> > [170321.960898] Received IB TimeWait exit for cm_id ffff88046c72da00.
> > [170321.960993] Session 0x00000000000000000002c9030005566e: kernel thread ib_srpt_compl (PID 9208) stopped
> > [170564.275488] INFO: task tcm_fabric:14473 blocked for more than 120 seconds.
> > [170564.275491]       Not tainted 3.13.0-gentoo-r1 #1
> > [170564.275492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [170564.275493] tcm_fabric      D ffff88047fdd2d00     0 14473  14444 0x00000004
> > [170564.275496]  ffff880372806790 0000000000000002 ffff88046f2e9810 ffff880406b5449f
> > [170564.275499]  0000000000012d00 ffff880395b35fd8 0000000000012d00 ffff880372806790
> > [170564.275501]  ffff880406b54110 ffff880312b94420 ffff880312b94428 7fffffffffffffff
> > [170564.275503] Call Trace:
> > [170564.275510]  [<ffffffff817335ba>] ? schedule_timeout+0x17a/0x1e0
> > [170564.275514]  [<ffffffff810923a3>] ? enqueue_task_fair+0x1b3/0xa90
> > [170564.275516]  [<ffffffff8173507d>] ? wait_for_completion+0x9d/0x110
> > [170564.275519]  [<ffffffff8108c790>] ? try_to_wake_up+0x280/0x280
> > [170564.275527]  [<ffffffffa01129f6>] ? transport_clear_lun_ref+0x46/0x70 [target_core_mod]
> > [170564.275532]  [<ffffffffa010d687>] ? core_tpg_post_dellun+0x27/0x60 [target_core_mod]
> > [170564.275537]  [<ffffffffa00ffc65>] ? core_dev_del_lun+0x35/0xb0 [target_core_mod]
> > [170564.275542]  [<ffffffffa01018e3>] ? target_fabric_port_unlink+0x43/0x60 [target_core_mod]
> > [170564.275545]  [<ffffffff811c870e>] ? configfs_unlink+0xee/0x1c0
> > [170564.275549]  [<ffffffff8116331a>] ? vfs_unlink+0xda/0x160
> > [170564.275551]  [<ffffffff811635ce>] ? do_unlinkat+0x22e/0x260
> > [170564.275554]  [<ffffffff810105d5>] ? syscall_trace_enter+0x115/0x1c0
> > [170564.275557]  [<ffffffff81737ae1>] ? tracesys+0xd4/0xd9
> 
> Thanks for reporting.
> 
> So starting with v3.13 code, this particular logic has been changed to
> use percpu refcounting.  I'm able to reproduce a similar issue with a
> different fabric driver, and currently in the process of tracking this
> bug down.
> 

Ok, just posted the following patch to address this regression bug on
v3.13.  I've been able to verify the fix using vhost/scsi, and am pretty
certain you're hitting the same bug with ib_srpt code.

target: Fix percpu_ref_put race in transport_lun_remove_cmd
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=7769401d351d54d5cbcb6400ec60c0b916e87a7e

Note this patch has been queued up for mainline as it does fix the bug
I've been hitting recently, but please go ahead and verify it addresses
your LUN shutdown issue with ib_srpt as well.

Thanks again,

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html