Hi Pascal & Co, Apologies for the delayed follow up. On Fri, 2017-07-21 at 15:07 +0200, Pascal de Bruijn wrote: <SNIP> > We still haven't been able to reliably reproduce, however we did come > across this, that's very likely related. > > Essentially we're running 'echo check > /sys/block/md0/md/sync_action' > throughout the night, with no practical sync_speed_max limit. > > Only in the morning presumably when I was doing stuff in VMware (I don't > particularly recall what), VMware started reporting the volume as > inaccessible. > > Here's the full dmesg, to provide maximum context (sorry for the long post): > No worries. The more verbosity the better when it comes to stuff like this. ;) Below is the analysis for the last logs. > [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288 > [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288 > [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_dpo attribute > [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_fua_read attribute > [Thu Jul 20 18:01:51 2017] qla2xxx [0000:81:00.1]-00af:13: Performing ISP error recovery - ha=ffff9ebcd33c0000. > [Thu Jul 20 18:01:53 2017] qla2xxx [0000:81:00.1]-500a:13: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:01:53 2017] qla2xxx [0000:82:00.0]-00af:14: Performing ISP error recovery - ha=ffff9ebcd3180000. > [Thu Jul 20 18:01:55 2017] qla2xxx [0000:82:00.0]-500a:14: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:01:56 2017] qla2xxx [0000:82:00.1]-00af:15: Performing ISP error recovery - ha=ffff9ebcd3400000. > [Thu Jul 20 18:01:57 2017] qla2xxx [0000:82:00.1]-500a:15: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:01:58 2017] qla2xxx [0000:81:00.0]-00af:2: Performing ISP error recovery - ha=ffff9ebcd4190000. > [Thu Jul 20 18:01:59 2017] qla2xxx [0000:81:00.0]-500a:2: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:04:42 2017] md: data-check of RAID array md0 > [Thu Jul 20 18:04:42 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 18:04:42 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 18:04:42 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 18:40:14 2017] md: md0: data-check done. > [Thu Jul 20 18:56:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 18:56:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 18:56:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 18:56:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 19:34:16 2017] md: md0: data-check done. > [Thu Jul 20 20:03:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 20:03:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 20:03:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 20:03:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 20:04:51 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1223620 > [Thu Jul 20 20:04:52 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1223620 > [Thu Jul 20 20:08:00 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1237348 > [Thu Jul 20 20:08:01 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1237348 Ok, so there are the first two occurrences of outstanding I/O to the backend MD device taking long enough to trigger the ESX host side ABORT_TASK timeouts. Notice the timestamps between 'Found referenced qla2xxx task_tag ...', and 'Sending TMR_FUNCTION_COMPLETE'. This tells us the backend I/O to the md device only took ~1 second to complete back to target-core, once the ABORT_TASK was received. > [Thu Jul 20 20:40:04 2017] md: md0: data-check done. > [Thu Jul 20 21:10:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 21:10:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 21:10:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 21:10:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 21:51:29 2017] md: md0: data-check done. > [Thu Jul 20 22:17:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 22:17:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 22:17:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 22:17:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 22:53:37 2017] md: md0: data-check done. > [Thu Jul 20 23:24:50 2017] md: data-check of RAID array md0 > [Thu Jul 20 23:24:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 23:24:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 23:24:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 00:04:39 2017] md: md0: data-check done. > [Fri Jul 21 00:31:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 00:31:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 00:31:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 00:31:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 01:09:07 2017] md: md0: data-check done. > [Fri Jul 21 01:38:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 01:38:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 01:38:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 01:38:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 01:39:11 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1161652 > [Fri Jul 21 01:39:12 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1161652 > [Fri Jul 21 02:05:54 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1213252 > [Fri Jul 21 02:05:55 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1213252 > [Fri Jul 21 02:20:30 2017] md: md0: data-check done. > [Fri Jul 21 02:45:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 02:45:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 02:45:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 02:45:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 03:08:14 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1243780 > [Fri Jul 21 03:08:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1243780 And the second occurrence. > [Fri Jul 21 03:28:21 2017] md: md0: data-check done. > [Fri Jul 21 03:52:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 03:52:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 03:52:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 03:52:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 03:58:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1162612 > [Fri Jul 21 03:58:34 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1162612 And the third. > [Fri Jul 21 04:33:10 2017] md: md0: data-check done. > [Fri Jul 21 04:59:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 04:59:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 04:59:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 04:59:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 05:40:34 2017] md: md0: data-check done. > [Fri Jul 21 06:06:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 06:06:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 06:06:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 06:06:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 06:46:58 2017] md: md0: data-check done. > [Fri Jul 21 07:13:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 07:13:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 07:13:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 07:13:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 07:59:38 2017] md: md0: data-check done. > [Fri Jul 21 08:20:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 08:20:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 08:20:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 08:20:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 09:04:12 2017] md: md0: data-check done. > [Fri Jul 21 09:27:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 09:27:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 09:27:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 09:27:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 10:09:28 2017] md: md0: data-check done. > [Fri Jul 21 10:34:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 10:34:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 10:34:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 10:34:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 10:46:22 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1247284 > [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1247284 > [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1247236 And the forth, and fifth which doesn't locate a tag (eg: the response has already been sent). > [Fri Jul 21 10:46:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1248148 > [Fri Jul 21 10:49:04 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1175812 And finally we reach this point, there target-core receives ABORT_TASKs for two tags (1248148 + 1175812), and waits for the backend md device to complete them.. > [Fri Jul 21 10:50:22 2017] INFO: task kworker/u34:2:4840 blocked for more than 180 seconds. > [Fri Jul 21 10:50:22 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:50:22 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:50:22 2017] kworker/u34:2 D 0 4840 2 0x00000000 > [Fri Jul 21 10:50:22 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod] > [Fri Jul 21 10:50:22 2017] 0000000000000000 ffff9eb4d5331880 ffff9ebccaccd240 ffff9ebcffcd5080 > [Fri Jul 21 10:50:22 2017] ffff9ebcdc5ca340 ffffb11c07d0fbe8 ffffffffa8092196 0000000000130b94 > [Fri Jul 21 10:50:22 2017] ffff9ebccaccd240 7fffffffffffffff 0000000000000000 ffff9ebccaccd240 > [Fri Jul 21 10:50:22 2017] Call Trace: > [Fri Jul 21 10:50:22 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 > [Fri Jul 21 10:50:22 2017] [<ffffffffa8092591>] schedule+0x31/0x80 > [Fri Jul 21 10:50:22 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7cb5edd>] ? up+0x2d/0x50 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90 > [Fri Jul 21 10:50:22 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:50:22 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:50:22 2017] [<ffffffffc02b0d9c>] __transport_wait_for_tasks+0xac/0x110 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [<ffffffffc02b0ece>] transport_wait_for_tasks+0x4e/0x80 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [<ffffffffc02aea79>] core_tmr_abort_task+0x119/0x190 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [<ffffffffc02b1011>] target_tmr_work+0x111/0x120 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 > [Fri Jul 21 10:50:22 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:50:22 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 First hung task for the tmr-iblock process, that is waiting for one of the two tags to complete from md back to target-core. > [Fri Jul 21 10:53:23 2017] INFO: task kworker/13:1:3063 blocked for more than 180 seconds. > [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:53:23 2017] kworker/13:1 D 0 3063 2 0x00000000 > [Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] > [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9ebccd9c2340 ffff9ebcffd55080 > [Fri Jul 21 10:53:23 2017] ffff9ebcdc5cde00 ffffb11c2736fb28 ffffffffa8092196 ffffffffa7ca4a37 > [Fri Jul 21 10:53:23 2017] ffff9ebccd9c2340 7fffffffffffffff 0000000000000000 ffff9ebccd9c2340 > [Fri Jul 21 10:53:23 2017] Call Trace: > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:53:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] > [Fri Jul 21 10:53:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 This is a hung task warning for the process responsible for doing qla2xxx FC session reinstatement. It's blocked (as expected) waiting for the backend I/O to complete for the ABORT_TASK in the tmr-iblock process. <SNIP> > [Fri Jul 21 10:53:23 2017] INFO: task kworker/u33:0:3806 blocked for more than 180 seconds. > [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:53:23 2017] kworker/u33:0 D 0 3806 2 0x00000000 > [Fri Jul 21 10:53:23 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod] > [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9eb4d5331880 ffff9eb468909780 ffff9eb4dfd15080 > [Fri Jul 21 10:53:23 2017] ffff9eb4dc68e9c0 ffffb11c26d47be8 ffffffffa8092196 000000000011f104 > [Fri Jul 21 10:53:23 2017] ffff9eb468909780 7fffffffffffffff 0000000000000000 ffff9eb468909780 > [Fri Jul 21 10:53:23 2017] Call Trace: > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7cb5edd>] ? up+0x2d/0x50 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:53:23 2017] [<ffffffffc02b0d9c>] __transport_wait_for_tasks+0xac/0x110 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [<ffffffffc02b0ece>] transport_wait_for_tasks+0x4e/0x80 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [<ffffffffc02aea79>] core_tmr_abort_task+0x119/0x190 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [<ffffffffc02b1011>] target_tmr_work+0x111/0x120 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 > [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:53:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 Second hung task warning for tmr-iblock progress, for the second outstanding I/O to the md backend device, waiting for completion. <SNIP> > [Fri Jul 21 10:56:23 2017] INFO: task kworker/2:1:20395 blocked for more than 180 seconds. > [Fri Jul 21 10:56:23 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:56:23 2017] kworker/2:1 D 0 20395 2 0x00000000 > [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] > [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9eb468940bc0 ffff9eb4dfc95080 > [Fri Jul 21 10:56:23 2017] ffff9eb4dc689780 ffffb11c077e7b28 ffffffffa8092196 ffffffffa7ca4a37 > [Fri Jul 21 10:56:23 2017] ffff9eb468940bc0 7fffffffffffffff 0000000000000000 ffff9eb468940bc0 > [Fri Jul 21 10:56:23 2017] Call Trace: > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:56:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] > [Fri Jul 21 10:56:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] > [Fri Jul 21 10:56:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 This is a hung task warning for (another) process responsible for doing qla2xxx FC session reinstatement. As before, it's blocked (as expected) waiting for the backend I/O to complete for the ABORT_TASK in the tmr-iblock process. > I hope this helps, let me know if you need anything from our end. > According to the hung task warnings for the two tmr-iblock processes, it would mean target-core is never getting I/O responses back from the backend driver. That is, there is never a 'Sending TMR_FUNCTION_COMPLETE for ref_tag' message for the two tags (1247284 + 1175812) undergoing ABORT_TASK, which would tend to indicate the md backend device is never completing I/O back to target-core for some reason.. To verify this is really the case, here is one more debug patch for you so we can see the se_cmd state once the hung task occurs. Please re-run this along with the original patches. diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 077344c..1263612 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2791,7 +2791,13 @@ void transport_clear_lun_ref(struct se_lun *lun) spin_unlock_irqrestore(&cmd->t_state_lock, *flags); - wait_for_completion(&cmd->t_transport_stop_comp); + while (!wait_for_completion_timeout(&cmd->t_transport_stop_comp, + 180 * HZ)) { + pr_warn_ratelimited("se_cmd waiting CDB: 0x%02x tag: %llu t_state: %d" + "se_cmd_flags: 0x%08x transport_state: 0x%08x\n", + cmd->t_task_cdb[0], cmd->tag, cmd->t_state, + cmd->se_cmd_flags, cmd->transport_state); + } spin_lock_irqsave(&cmd->t_state_lock, *flags); cmd->transport_state &= ~(CMD_T_ACTIVE | CMD_T_STOP); Also, after thinking more about the earlier report of the 'D' state processes, my guess is you're hitting the same type of problem where md is presumably not completing I/Os back to target-core. So all that said, I think it's safe to push part of original qla2xxx patch to address the original BUG_ON() you reported, as AFICT that is separate from whatever else is going on here with md. I'll send out an smaller version of that patch shortly. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html