Re: task hung on target unload

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 28 Jan 2014 11:57:43 -0800

Hi Tommy,

On Sat, 2014-01-25 at 15:19 +0100, Tommy Apel wrote:
> Hello, after disconnecting my srp initiator and trying to shut down
> the target I end up with a hung/stale system, I have experienced this
> on both 3.13.0 and 3.10.25
> 
> Here is the dmesg
> 
> [170319.904119] Received DREQ and sent DREP for session 0x00000000000000000002c9030005566e.
> [170321.960898] Received IB TimeWait exit for cm_id ffff88046c72da00.
> [170321.960993] Session 0x00000000000000000002c9030005566e: kernel thread ib_srpt_compl (PID 9208) stopped
> [170564.275488] INFO: task tcm_fabric:14473 blocked for more than 120 seconds.
> [170564.275491]       Not tainted 3.13.0-gentoo-r1 #1
> [170564.275492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [170564.275493] tcm_fabric      D ffff88047fdd2d00     0 14473  14444 0x00000004
> [170564.275496]  ffff880372806790 0000000000000002 ffff88046f2e9810 ffff880406b5449f
> [170564.275499]  0000000000012d00 ffff880395b35fd8 0000000000012d00 ffff880372806790
> [170564.275501]  ffff880406b54110 ffff880312b94420 ffff880312b94428 7fffffffffffffff
> [170564.275503] Call Trace:
> [170564.275510]  [<ffffffff817335ba>] ? schedule_timeout+0x17a/0x1e0
> [170564.275514]  [<ffffffff810923a3>] ? enqueue_task_fair+0x1b3/0xa90
> [170564.275516]  [<ffffffff8173507d>] ? wait_for_completion+0x9d/0x110
> [170564.275519]  [<ffffffff8108c790>] ? try_to_wake_up+0x280/0x280
> [170564.275527]  [<ffffffffa01129f6>] ? transport_clear_lun_ref+0x46/0x70 [target_core_mod]
> [170564.275532]  [<ffffffffa010d687>] ? core_tpg_post_dellun+0x27/0x60 [target_core_mod]
> [170564.275537]  [<ffffffffa00ffc65>] ? core_dev_del_lun+0x35/0xb0 [target_core_mod]
> [170564.275542]  [<ffffffffa01018e3>] ? target_fabric_port_unlink+0x43/0x60 [target_core_mod]
> [170564.275545]  [<ffffffff811c870e>] ? configfs_unlink+0xee/0x1c0
> [170564.275549]  [<ffffffff8116331a>] ? vfs_unlink+0xda/0x160
> [170564.275551]  [<ffffffff811635ce>] ? do_unlinkat+0x22e/0x260
> [170564.275554]  [<ffffffff810105d5>] ? syscall_trace_enter+0x115/0x1c0
> [170564.275557]  [<ffffffff81737ae1>] ? tracesys+0xd4/0xd9

Thanks for reporting.

So starting with v3.13 code, this particular logic has been changed to
use percpu refcounting.  I'm able to reproduce a similar issue with a
different fabric driver, and currently in the process of tracking this
bug down.

AFAICT this was a v3.13 specific regression, but given your comment
above it sounds like there is an issue on v3.10.y code (at least for SRP
anyways).

Btw, It would be helpful to see a dmesg log on v3.10.y code for this bug
as well.

Thanks,

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html