Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3643! (or :3319)

Łukasz Engel <lukasz.engel@xxxxxxxxx> · Mon, 31 Jul 2017 17:15:35 +0200

Hi Nicolas,

W dniu 31.07.2017 o 02:00, Nicholas A. Bellinger pisze:
Hi Lukasz,

On Thu, 2017-07-27 at 17:23 +0200, Łukasz Engel wrote:
Hello

I am experiencing problems using LIO FC target with vmware ESX (6.5).

Thisis the same problem as discussed in
https://www.spinics.net/lists/target-devel/msg15670.html
I've tried 4.11.x (up to 12) and 4.12.3(and also some kernels from 4.9
line).
[...]
Yes, this is most certainly the same bug the patch here addresses:

https://www.spinics.net/lists/target-devel/msg15759.html

Also, a smaller version of that just went out with your CC' on it here:

qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=6bcbb3174caa5f1ccc894f8ae077631659d5a629
After applying both patches (in 6bcbb3174caa5f1ccc894f8ae077631659d5a629 
manual adjusting in second chunk was required for 4.12.x - after-match 
is slightly different) - there is no BUG, but hang in kworker instead, 
with all I/Os from FC frozen:

Jul 31 16:27:14 teststorage [   42.897291] qla2xxx 
[0000:06:00.0]-500a:12: LOOP UP detected (4 Gbps).
Jul 31 16:27:16 teststorage [   42.959068] qla2xxx 
[0000:09:00.0]-ffff:11: qla24xx_do_nack_work create sess success 
ffff881f80e29200
Jul 31 16:27:16 teststorage [   44.144811] qla2xxx 
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success 
ffff881f80e28400
Jul 31 16:27:16 teststorage [   44.230770] qla2xxx 
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success 
ffff881f8f575a00
Jul 31 16:27:16 teststorage [   44.473650] qla2xxx 
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success 
ffff881f8f575600
Jul 31 16:27:16 teststorage [   44.520821] qla2xxx 
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success 
ffff881f8f574400
Jul 31 16:27:16 teststorage [   44.538183] qla2xxx 
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success 
ffff881f900f3400
Jul 31 16:27:16 teststorage [   44.984052] qla2xxx 
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success 
ffff881f8f577a00
Jul 31 16:28:54 teststorage [   45.042585] qla2xxx 
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success 
ffff881f80e28000
Jul 31 16:28:54 teststorage [  142.431045] ABORT_TASK: Found referenced 
qla2xxx task_tag: 1189328
Jul 31 16:28:54 teststorage [  142.431059] ABORT_TASK: Sending 
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1189328
Jul 31 16:28:57 teststorage [  145.632571] ABORT_TASK: Found referenced 
qla2xxx task_tag: 1177932
Jul 31 16:28:57 teststorage [  145.897299] ABORT_TASK: Sending 
TMR_FUNCTION_COMPLETE for ref_tag: 1177932
Jul 31 16:29:00 teststorage [  148.388569] ABORT_TASK: Found referenced 
qla2xxx task_tag: 1164160
Jul 31 16:32:00 teststorage [  328.848344] se_cmd waiting CDB: 0x89 tag: 
1164160 t_state: 5se_cmd_flags: 0x01c80109 transport_state: 0x00000033
Jul 31 16:32:41 teststorage [  369.808484] INFO: task kworker/1:93:2802 
blocked for more than 120 seconds.
Jul 31 16:32:41 teststorage [  369.808493]       Not tainted 4.12.3p1 #3
Jul 31 16:32:41 teststorage [  369.808496] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 16:32:41 teststorage [  369.808499] kworker/1:93    D    0 
2802      2 0x00000080
Jul 31 16:32:41 teststorage [  369.808515] Workqueue: events 
qlt_free_session_done [qla2xxx]
Jul 31 16:32:41 teststorage [  369.808518] Call Trace:
Jul 31 16:32:41 teststorage [  369.808524]  __schedule+0x285/0x840
Jul 31 16:32:41 teststorage [  369.808528]  schedule+0x36/0x80
Jul 31 16:32:41 teststorage [  369.808531] schedule_timeout+0x242/0x2f0
Jul 31 16:32:41 teststorage [  369.808537]  ? ttwu_do_activate+0x77/0x80
Jul 31 16:32:41 teststorage [  369.808540] wait_for_completion+0x111/0x170
Jul 31 16:32:41 teststorage [  369.808544]  ? wake_up_q+0x80/0x80
Jul 31 16:32:41 teststorage [  369.808555] 
target_wait_for_sess_cmds+0x49/0x170 [target_core_mod]
Jul 31 16:32:41 teststorage [  369.808560] 
tcm_qla2xxx_free_session+0x45/0x90 [tcm_qla2xxx]
Jul 31 16:32:41 teststorage [  369.808572] 
qlt_free_session_done+0x109/0x5a0 [qla2xxx]
Jul 31 16:32:41 teststorage [  369.808579]  ? qlt_unreg_sess+0xe7/0xf0 
[qla2xxx]
Jul 31 16:32:41 teststorage [  369.808583]  ? 
tcm_qla2xxx_put_sess+0x3c/0x50 [tcm_qla2xxx]
Jul 31 16:32:41 teststorage [  369.808587] process_one_work+0x144/0x370
Jul 31 16:32:41 teststorage [  369.808591]  worker_thread+0x4d/0x3c0
Jul 31 16:32:41 teststorage [  369.808594]  kthread+0x109/0x140
Jul 31 16:32:41 teststorage [  369.808597]  ? rescuer_thread+0x360/0x360
Jul 31 16:32:41 teststorage [  369.808600]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.808720]  ret_from_fork+0x25/0x30
Jul 31 16:32:41 teststorage [  369.809016] NMI backtrace for cpu 3
Jul 31 16:32:41 teststorage [  369.809446] CPU: 3 PID: 107 Comm: 
khungtaskd Not tainted 4.12.3p1 #3
Jul 31 16:32:41 teststorage [  369.809886] Hardware name: ASUS All 
Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
Jul 31 16:32:41 teststorage [  369.810326] Call Trace:
Jul 31 16:32:41 teststorage [  369.810764]  dump_stack+0x63/0x8e
Jul 31 16:32:41 teststorage [  369.811190] nmi_cpu_backtrace+0x9b/0xa0
Jul 31 16:32:41 teststorage [  369.811613]  ? 
irq_force_complete_move+0xf0/0xf0
Jul 31 16:32:41 teststorage [  369.812036] 
nmi_trigger_cpumask_backtrace+0x99/0xd0
Jul 31 16:32:41 teststorage [  369.812461] 
arch_trigger_cpumask_backtrace+0x19/0x20
Jul 31 16:32:41 teststorage [  369.812890]  watchdog+0x2c7/0x340
Jul 31 16:32:41 teststorage [  369.813321]  kthread+0x109/0x140
Jul 31 16:32:41 teststorage [  369.813748]  ? 
reset_hung_task_detector+0x20/0x20
Jul 31 16:32:41 teststorage [  369.814171]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.814595]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.815013]  ret_from_fork+0x25/0x30
Jul 31 16:32:41 teststorage [  369.815428] Sending NMI from CPU 3 to 
CPUs 0-2,4-11:
Jul 31 16:32:41 teststorage [  369.815846] NMI backtrace for cpu 0 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.815946] NMI backtrace for cpu 1 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816045] NMI backtrace for cpu 2 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816146] NMI backtrace for cpu 4 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816247] NMI backtrace for cpu 5 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816352] NMI backtrace for cpu 6 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816446] NMI backtrace for cpu 7 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816547] NMI backtrace for cpu 8 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816648] NMI backtrace for cpu 9 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816747] NMI backtrace for cpu 10 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816851] NMI backtrace for cpu 11 
skipped: idling at pc 0xffffffff818b5ac6
Jul 31 16:32:41 teststorage [  369.816855] Kernel panic - not syncing: 
hung_task: blocked tasks
Jul 31 16:32:41 teststorage [  369.816857] CPU: 3 PID: 107 Comm: 
khungtaskd Not tainted 4.12.3p1 #3
Jul 31 16:32:41 teststorage [  369.816857] Hardware name: ASUS All 
Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
Jul 31 16:32:41 teststorage [  369.816857] Call Trace:
Jul 31 16:32:41 teststorage [  369.816859]  dump_stack+0x63/0x8e
Jul 31 16:32:41 teststorage [  369.816861]  panic+0xeb/0x239
Jul 31 16:32:41 teststorage [  369.816863]  watchdog+0x2d5/0x340
Jul 31 16:32:41 teststorage [  369.816864]  kthread+0x109/0x140
Jul 31 16:32:41 teststorage [  369.816864]  ? 
reset_hung_task_detector+0x20/0x20
Jul 31 16:32:41 teststorage [  369.816865]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.816866]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.816866]  ret_from_fork+0x25/0x30

8<=====

Also, some more details on what your backend device storage behind
target-core looks like would be helpful.

Physical HBAs, device information, and any MD/RAID/LVM/etc on top..

I have adaptec 7805 HW RAID (6x SAS 600G 15k hdd in RAID0) - exporting 1 
device to system (sdc),
LVM sits on this device and LV (vgtest/test) is used as LIO block 
backstore and is exported to ESXs.
(for system I have separate 2 SATA HDDs in software raid1 + LVM)

[root@TESTSTORAGE ~]# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdb                   8:16   0 465,8G  0 disk
└─md126               9:126  0 465,8G  0 raid1
  ├─md126p3         259:2    0   314G  0 md
  │ ├─vgsystem-swap 253:1    0    64G  0 lvm
  │ ├─vgsystem-root 253:0    0   100G  0 lvm   /
  │ └─vgsystem-home 253:3    0   200G  0 lvm   /home
  ├─md126p1         259:0    0   200M  0 md    /boot/efi
  ├─md126p4         259:3    0 150,6G  0 md
  │ └─vgsystem-root 253:0    0   100G  0 lvm   /
  └─md126p2         259:1    0     1G  0 md    /boot
sdc                   8:32   1   3,3T  0 disk
└─vgtest-test       253:2    0   3,3T  0 lvm
sda                   8:0    0 465,8G  0 disk
└─md126               9:126  0 465,8G  0 raid1
  ├─md126p3         259:2    0   314G  0 md
  │ ├─vgsystem-swap 253:1    0    64G  0 lvm
  │ ├─vgsystem-root 253:0    0   100G  0 lvm   /
  │ └─vgsystem-home 253:3    0   200G  0 lvm   /home
  ├─md126p1         259:0    0   200M  0 md    /boot/efi
  ├─md126p4         259:3    0 150,6G  0 md
  │ └─vgsystem-root 253:0    0   100G  0 lvm   /
  └─md126p2         259:1    0     1G  0 md    /boot

8<=========

--
Lukasz Engel

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html