Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Pascal & Co,

Apologies for the delayed follow up.

On Fri, 2017-07-21 at 15:07 +0200, Pascal de Bruijn wrote:

<SNIP>
	
> We still haven't been able to reliably reproduce, however we did come 
> across this, that's very likely related.
> 
> Essentially we're running 'echo check > /sys/block/md0/md/sync_action' 
> throughout the night, with no practical sync_speed_max limit.
> 
> Only in the morning presumably when I was doing stuff in VMware (I don't 
> particularly recall what), VMware started reporting the volume as 
> inaccessible.
> 
> Here's the full dmesg, to provide maximum context (sorry for the long post):
> 

No worries.  The more verbosity the better when it comes to stuff like
this.  ;)

Below is the analysis for the last logs.

> [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288
> [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288
> [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_dpo attribute
> [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_fua_read attribute
> [Thu Jul 20 18:01:51 2017] qla2xxx [0000:81:00.1]-00af:13: Performing ISP error recovery - ha=ffff9ebcd33c0000.
> [Thu Jul 20 18:01:53 2017] qla2xxx [0000:81:00.1]-500a:13: LOOP UP detected (4 Gbps).
> [Thu Jul 20 18:01:53 2017] qla2xxx [0000:82:00.0]-00af:14: Performing ISP error recovery - ha=ffff9ebcd3180000.
> [Thu Jul 20 18:01:55 2017] qla2xxx [0000:82:00.0]-500a:14: LOOP UP detected (4 Gbps).
> [Thu Jul 20 18:01:56 2017] qla2xxx [0000:82:00.1]-00af:15: Performing ISP error recovery - ha=ffff9ebcd3400000.
> [Thu Jul 20 18:01:57 2017] qla2xxx [0000:82:00.1]-500a:15: LOOP UP detected (4 Gbps).
> [Thu Jul 20 18:01:58 2017] qla2xxx [0000:81:00.0]-00af:2: Performing ISP error recovery - ha=ffff9ebcd4190000.
> [Thu Jul 20 18:01:59 2017] qla2xxx [0000:81:00.0]-500a:2: LOOP UP detected (4 Gbps).
> [Thu Jul 20 18:04:42 2017] md: data-check of RAID array md0
> [Thu Jul 20 18:04:42 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Thu Jul 20 18:04:42 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Thu Jul 20 18:04:42 2017] md: using 128k window, over a total of 405798912k.
> [Thu Jul 20 18:40:14 2017] md: md0: data-check done.
> [Thu Jul 20 18:56:49 2017] md: data-check of RAID array md0
> [Thu Jul 20 18:56:49 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Thu Jul 20 18:56:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Thu Jul 20 18:56:49 2017] md: using 128k window, over a total of 405798912k.
> [Thu Jul 20 19:34:16 2017] md: md0: data-check done.
> [Thu Jul 20 20:03:49 2017] md: data-check of RAID array md0
> [Thu Jul 20 20:03:49 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Thu Jul 20 20:03:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Thu Jul 20 20:03:49 2017] md: using 128k window, over a total of 405798912k.
> [Thu Jul 20 20:04:51 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1223620
> [Thu Jul 20 20:04:52 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1223620
> [Thu Jul 20 20:08:00 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1237348
> [Thu Jul 20 20:08:01 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1237348

Ok, so there are the first two occurrences of outstanding I/O to the
backend MD device taking long enough to trigger the ESX host side
ABORT_TASK timeouts.

Notice the timestamps between 'Found referenced qla2xxx task_tag ...',
and 'Sending TMR_FUNCTION_COMPLETE'.

This tells us the backend I/O to the md device only took ~1 second to
complete back to target-core, once the ABORT_TASK was received.

> [Thu Jul 20 20:40:04 2017] md: md0: data-check done.
> [Thu Jul 20 21:10:49 2017] md: data-check of RAID array md0
> [Thu Jul 20 21:10:49 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Thu Jul 20 21:10:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Thu Jul 20 21:10:49 2017] md: using 128k window, over a total of 405798912k.
> [Thu Jul 20 21:51:29 2017] md: md0: data-check done.
> [Thu Jul 20 22:17:49 2017] md: data-check of RAID array md0
> [Thu Jul 20 22:17:49 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Thu Jul 20 22:17:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Thu Jul 20 22:17:49 2017] md: using 128k window, over a total of 405798912k.
> [Thu Jul 20 22:53:37 2017] md: md0: data-check done.
> [Thu Jul 20 23:24:50 2017] md: data-check of RAID array md0
> [Thu Jul 20 23:24:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Thu Jul 20 23:24:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Thu Jul 20 23:24:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 00:04:39 2017] md: md0: data-check done.
> [Fri Jul 21 00:31:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 00:31:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 00:31:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 00:31:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 01:09:07 2017] md: md0: data-check done.
> [Fri Jul 21 01:38:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 01:38:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 01:38:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 01:38:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 01:39:11 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1161652
> [Fri Jul 21 01:39:12 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1161652
> [Fri Jul 21 02:05:54 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1213252
> [Fri Jul 21 02:05:55 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1213252
> [Fri Jul 21 02:20:30 2017] md: md0: data-check done.
> [Fri Jul 21 02:45:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 02:45:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 02:45:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 02:45:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 03:08:14 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1243780
> [Fri Jul 21 03:08:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1243780

And the second occurrence.

> [Fri Jul 21 03:28:21 2017] md: md0: data-check done.
> [Fri Jul 21 03:52:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 03:52:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 03:52:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 03:52:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 03:58:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1162612
> [Fri Jul 21 03:58:34 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1162612

And the third.

> [Fri Jul 21 04:33:10 2017] md: md0: data-check done.
> [Fri Jul 21 04:59:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 04:59:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 04:59:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 04:59:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 05:40:34 2017] md: md0: data-check done.
> [Fri Jul 21 06:06:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 06:06:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 06:06:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 06:06:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 06:46:58 2017] md: md0: data-check done.
> [Fri Jul 21 07:13:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 07:13:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 07:13:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 07:13:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 07:59:38 2017] md: md0: data-check done.
> [Fri Jul 21 08:20:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 08:20:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 08:20:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 08:20:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 09:04:12 2017] md: md0: data-check done.
> [Fri Jul 21 09:27:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 09:27:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 09:27:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 09:27:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 10:09:28 2017] md: md0: data-check done.
> [Fri Jul 21 10:34:50 2017] md: data-check of RAID array md0
> [Fri Jul 21 10:34:50 2017] md: minimum _guaranteed_  speed: 15000 KB/sec/disk.
> [Fri Jul 21 10:34:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> [Fri Jul 21 10:34:50 2017] md: using 128k window, over a total of 405798912k.
> [Fri Jul 21 10:46:22 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1247284
> [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1247284
> [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1247236

And the forth, and fifth which doesn't locate a tag (eg: the response
has already been sent).

> [Fri Jul 21 10:46:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1248148
> [Fri Jul 21 10:49:04 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1175812

And finally we reach this point, there target-core receives ABORT_TASKs
for two tags (1248148 + 1175812), and waits for the backend md device to
complete them..

> [Fri Jul 21 10:50:22 2017] INFO: task kworker/u34:2:4840 blocked for more than 180 seconds.
> [Fri Jul 21 10:50:22 2017]       Not tainted 4.9.38-ul5 #3
> [Fri Jul 21 10:50:22 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Fri Jul 21 10:50:22 2017] kworker/u34:2   D    0  4840      2 0x00000000
> [Fri Jul 21 10:50:22 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod]
> [Fri Jul 21 10:50:22 2017]  0000000000000000 ffff9eb4d5331880 ffff9ebccaccd240 ffff9ebcffcd5080
> [Fri Jul 21 10:50:22 2017]  ffff9ebcdc5ca340 ffffb11c07d0fbe8 ffffffffa8092196 0000000000130b94
> [Fri Jul 21 10:50:22 2017]  ffff9ebccaccd240 7fffffffffffffff 0000000000000000 ffff9ebccaccd240
> [Fri Jul 21 10:50:22 2017] Call Trace:
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7cb5edd>] ? up+0x2d/0x50
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20
> [Fri Jul 21 10:50:22 2017]  [<ffffffffc02b0d9c>] __transport_wait_for_tasks+0xac/0x110 [target_core_mod]
> [Fri Jul 21 10:50:22 2017]  [<ffffffffc02b0ece>] transport_wait_for_tasks+0x4e/0x80 [target_core_mod]
> [Fri Jul 21 10:50:22 2017]  [<ffffffffc02aea79>] core_tmr_abort_task+0x119/0x190 [target_core_mod]
> [Fri Jul 21 10:50:22 2017]  [<ffffffffc02b1011>] target_tmr_work+0x111/0x120 [target_core_mod]
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7c93111>] process_one_work+0x121/0x3c0
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100
> [Fri Jul 21 10:50:22 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30

First hung task for the tmr-iblock process, that is waiting for one of
the two tags to complete from md back to target-core.

> [Fri Jul 21 10:53:23 2017] INFO: task kworker/13:1:3063 blocked for more than 180 seconds.
> [Fri Jul 21 10:53:23 2017]       Not tainted 4.9.38-ul5 #3
> [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Fri Jul 21 10:53:23 2017] kworker/13:1    D    0  3063      2 0x00000000
> [Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
> [Fri Jul 21 10:53:23 2017]  0000000000000000 ffff9ebcd4c38000 ffff9ebccd9c2340 ffff9ebcffd55080
> [Fri Jul 21 10:53:23 2017]  ffff9ebcdc5cde00 ffffb11c2736fb28 ffffffffa8092196 ffffffffa7ca4a37
> [Fri Jul 21 10:53:23 2017]  ffff9ebccd9c2340 7fffffffffffffff 0000000000000000 ffff9ebccd9c2340
> [Fri Jul 21 10:53:23 2017] Call Trace:
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20
> [Fri Jul 21 10:53:23 2017]  [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
> [Fri Jul 21 10:53:23 2017]  [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
> [Fri Jul 21 10:53:23 2017]  [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c93111>] process_one_work+0x121/0x3c0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30

This is a hung task warning for the process responsible for doing
qla2xxx FC session reinstatement.

It's blocked (as expected) waiting for the backend I/O to complete for
the ABORT_TASK in the tmr-iblock process.

<SNIP>

> [Fri Jul 21 10:53:23 2017] INFO: task kworker/u33:0:3806 blocked for more than 180 seconds.
> [Fri Jul 21 10:53:23 2017]       Not tainted 4.9.38-ul5 #3
> [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Fri Jul 21 10:53:23 2017] kworker/u33:0   D    0  3806      2 0x00000000
> [Fri Jul 21 10:53:23 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod]
> [Fri Jul 21 10:53:23 2017]  0000000000000000 ffff9eb4d5331880 ffff9eb468909780 ffff9eb4dfd15080
> [Fri Jul 21 10:53:23 2017]  ffff9eb4dc68e9c0 ffffb11c26d47be8 ffffffffa8092196 000000000011f104
> [Fri Jul 21 10:53:23 2017]  ffff9eb468909780 7fffffffffffffff 0000000000000000 ffff9eb468909780
> [Fri Jul 21 10:53:23 2017] Call Trace:
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7cb5edd>] ? up+0x2d/0x50
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20
> [Fri Jul 21 10:53:23 2017]  [<ffffffffc02b0d9c>] __transport_wait_for_tasks+0xac/0x110 [target_core_mod]
> [Fri Jul 21 10:53:23 2017]  [<ffffffffc02b0ece>] transport_wait_for_tasks+0x4e/0x80 [target_core_mod]
> [Fri Jul 21 10:53:23 2017]  [<ffffffffc02aea79>] core_tmr_abort_task+0x119/0x190 [target_core_mod]
> [Fri Jul 21 10:53:23 2017]  [<ffffffffc02b1011>] target_tmr_work+0x111/0x120 [target_core_mod]
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c93111>] process_one_work+0x121/0x3c0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100
> [Fri Jul 21 10:53:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30

Second hung task warning for tmr-iblock progress, for the second
outstanding I/O to the md backend device, waiting for completion.

<SNIP>

> [Fri Jul 21 10:56:23 2017] INFO: task kworker/2:1:20395 blocked for more than 180 seconds.
> [Fri Jul 21 10:56:23 2017]       Not tainted 4.9.38-ul5 #3
> [Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Fri Jul 21 10:56:23 2017] kworker/2:1     D    0 20395      2 0x00000000
> [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
> [Fri Jul 21 10:56:23 2017]  0000000000000000 ffff9ebcd4c38000 ffff9eb468940bc0 ffff9eb4dfc95080
> [Fri Jul 21 10:56:23 2017]  ffff9eb4dc689780 ffffb11c077e7b28 ffffffffa8092196 ffffffffa7ca4a37
> [Fri Jul 21 10:56:23 2017]  ffff9eb468940bc0 7fffffffffffffff 0000000000000000 ffff9eb468940bc0
> [Fri Jul 21 10:56:23 2017] Call Trace:
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20
> [Fri Jul 21 10:56:23 2017]  [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
> [Fri Jul 21 10:56:23 2017]  [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
> [Fri Jul 21 10:56:23 2017]  [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c93111>] process_one_work+0x121/0x3c0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30

This is a hung task warning for (another) process responsible for doing
qla2xxx FC session reinstatement.

As before, it's blocked (as expected) waiting for the backend I/O to
complete for the ABORT_TASK in the tmr-iblock process.

> I hope this helps, let me know if you need anything from our end.
> 

According to the hung task warnings for the two tmr-iblock processes, it
would mean target-core is never getting I/O responses back from the
backend driver.

That is, there is never a 'Sending TMR_FUNCTION_COMPLETE for ref_tag'
message for the two tags (1247284 + 1175812) undergoing ABORT_TASK,
which would tend to indicate the md backend device is never completing
I/O back to target-core for some reason..

To verify this is really the case, here is one more debug patch for you
so we can see the se_cmd state once the hung task occurs.  Please re-run
this along with the original patches.

diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 077344c..1263612 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -2791,7 +2791,13 @@ void transport_clear_lun_ref(struct se_lun *lun)
 
        spin_unlock_irqrestore(&cmd->t_state_lock, *flags);
 
-       wait_for_completion(&cmd->t_transport_stop_comp);
+       while (!wait_for_completion_timeout(&cmd->t_transport_stop_comp,
+                                           180 * HZ)) {
+               pr_warn_ratelimited("se_cmd waiting CDB: 0x%02x tag: %llu t_state: %d"
+                       "se_cmd_flags: 0x%08x transport_state: 0x%08x\n",
+                       cmd->t_task_cdb[0], cmd->tag, cmd->t_state,
+                       cmd->se_cmd_flags, cmd->transport_state);
+       }
 
        spin_lock_irqsave(&cmd->t_state_lock, *flags);
        cmd->transport_state &= ~(CMD_T_ACTIVE | CMD_T_STOP);

Also, after thinking more about the earlier report of the 'D' state
processes, my guess is you're hitting the same type of problem where md
is presumably not completing I/Os back to target-core.

So all that said, I think it's safe to push part of original qla2xxx
patch to address the original BUG_ON() you reported, as AFICT that is
separate from whatever else is going on here with md.

I'll send out an smaller version of that patch shortly.


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux