Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3643! (or :3319)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nicolas,

W dniu 31.07.2017 o 02:00, Nicholas A. Bellinger pisze:
Hi Lukasz,

On Thu, 2017-07-27 at 17:23 +0200, Łukasz Engel wrote:
Hello

I am experiencing problems using LIO FC target with vmware ESX (6.5).

Thisis the same problem as discussed in
https://www.spinics.net/lists/target-devel/msg15670.html
I've tried 4.11.x (up to 12) and 4.12.3(and also some kernels from 4.9
line).
[...]
Yes, this is most certainly the same bug the patch here addresses:

https://www.spinics.net/lists/target-devel/msg15759.html

Also, a smaller version of that just went out with your CC' on it here:

qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=6bcbb3174caa5f1ccc894f8ae077631659d5a629
After applying both patches (in 6bcbb3174caa5f1ccc894f8ae077631659d5a629 manual adjusting in second chunk was required for 4.12.x - after-match is slightly different) - there is no BUG, but hang in kworker instead, with all I/Os from FC frozen:

Jul 31 16:27:14 teststorage [ 42.897291] qla2xxx [0000:06:00.0]-500a:12: LOOP UP detected (4 Gbps). Jul 31 16:27:16 teststorage [ 42.959068] qla2xxx [0000:09:00.0]-ffff:11: qla24xx_do_nack_work create sess success ffff881f80e29200 Jul 31 16:27:16 teststorage [ 44.144811] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f80e28400 Jul 31 16:27:16 teststorage [ 44.230770] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f8f575a00 Jul 31 16:27:16 teststorage [ 44.473650] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f8f575600 Jul 31 16:27:16 teststorage [ 44.520821] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f8f574400 Jul 31 16:27:16 teststorage [ 44.538183] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f900f3400 Jul 31 16:27:16 teststorage [ 44.984052] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f8f577a00 Jul 31 16:28:54 teststorage [ 45.042585] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f80e28000 Jul 31 16:28:54 teststorage [ 142.431045] ABORT_TASK: Found referenced qla2xxx task_tag: 1189328 Jul 31 16:28:54 teststorage [ 142.431059] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1189328 Jul 31 16:28:57 teststorage [ 145.632571] ABORT_TASK: Found referenced qla2xxx task_tag: 1177932 Jul 31 16:28:57 teststorage [ 145.897299] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1177932 Jul 31 16:29:00 teststorage [ 148.388569] ABORT_TASK: Found referenced qla2xxx task_tag: 1164160 Jul 31 16:32:00 teststorage [ 328.848344] se_cmd waiting CDB: 0x89 tag: 1164160 t_state: 5se_cmd_flags: 0x01c80109 transport_state: 0x00000033 Jul 31 16:32:41 teststorage [ 369.808484] INFO: task kworker/1:93:2802 blocked for more than 120 seconds.
Jul 31 16:32:41 teststorage [  369.808493]       Not tainted 4.12.3p1 #3
Jul 31 16:32:41 teststorage [ 369.808496] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 31 16:32:41 teststorage [ 369.808499] kworker/1:93 D 0 2802 2 0x00000080 Jul 31 16:32:41 teststorage [ 369.808515] Workqueue: events qlt_free_session_done [qla2xxx]
Jul 31 16:32:41 teststorage [  369.808518] Call Trace:
Jul 31 16:32:41 teststorage [  369.808524]  __schedule+0x285/0x840
Jul 31 16:32:41 teststorage [  369.808528]  schedule+0x36/0x80
Jul 31 16:32:41 teststorage [  369.808531] schedule_timeout+0x242/0x2f0
Jul 31 16:32:41 teststorage [  369.808537]  ? ttwu_do_activate+0x77/0x80
Jul 31 16:32:41 teststorage [  369.808540] wait_for_completion+0x111/0x170
Jul 31 16:32:41 teststorage [  369.808544]  ? wake_up_q+0x80/0x80
Jul 31 16:32:41 teststorage [ 369.808555] target_wait_for_sess_cmds+0x49/0x170 [target_core_mod] Jul 31 16:32:41 teststorage [ 369.808560] tcm_qla2xxx_free_session+0x45/0x90 [tcm_qla2xxx] Jul 31 16:32:41 teststorage [ 369.808572] qlt_free_session_done+0x109/0x5a0 [qla2xxx] Jul 31 16:32:41 teststorage [ 369.808579] ? qlt_unreg_sess+0xe7/0xf0 [qla2xxx] Jul 31 16:32:41 teststorage [ 369.808583] ? tcm_qla2xxx_put_sess+0x3c/0x50 [tcm_qla2xxx]
Jul 31 16:32:41 teststorage [  369.808587] process_one_work+0x144/0x370
Jul 31 16:32:41 teststorage [  369.808591]  worker_thread+0x4d/0x3c0
Jul 31 16:32:41 teststorage [  369.808594]  kthread+0x109/0x140
Jul 31 16:32:41 teststorage [  369.808597]  ? rescuer_thread+0x360/0x360
Jul 31 16:32:41 teststorage [  369.808600]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.808720]  ret_from_fork+0x25/0x30
Jul 31 16:32:41 teststorage [  369.809016] NMI backtrace for cpu 3
Jul 31 16:32:41 teststorage [ 369.809446] CPU: 3 PID: 107 Comm: khungtaskd Not tainted 4.12.3p1 #3 Jul 31 16:32:41 teststorage [ 369.809886] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
Jul 31 16:32:41 teststorage [  369.810326] Call Trace:
Jul 31 16:32:41 teststorage [  369.810764]  dump_stack+0x63/0x8e
Jul 31 16:32:41 teststorage [  369.811190] nmi_cpu_backtrace+0x9b/0xa0
Jul 31 16:32:41 teststorage [ 369.811613] ? irq_force_complete_move+0xf0/0xf0 Jul 31 16:32:41 teststorage [ 369.812036] nmi_trigger_cpumask_backtrace+0x99/0xd0 Jul 31 16:32:41 teststorage [ 369.812461] arch_trigger_cpumask_backtrace+0x19/0x20
Jul 31 16:32:41 teststorage [  369.812890]  watchdog+0x2c7/0x340
Jul 31 16:32:41 teststorage [  369.813321]  kthread+0x109/0x140
Jul 31 16:32:41 teststorage [ 369.813748] ? reset_hung_task_detector+0x20/0x20
Jul 31 16:32:41 teststorage [  369.814171]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.814595]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.815013]  ret_from_fork+0x25/0x30
Jul 31 16:32:41 teststorage [ 369.815428] Sending NMI from CPU 3 to CPUs 0-2,4-11: Jul 31 16:32:41 teststorage [ 369.815846] NMI backtrace for cpu 0 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.815946] NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816045] NMI backtrace for cpu 2 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816146] NMI backtrace for cpu 4 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816247] NMI backtrace for cpu 5 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816352] NMI backtrace for cpu 6 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816446] NMI backtrace for cpu 7 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816547] NMI backtrace for cpu 8 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816648] NMI backtrace for cpu 9 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816747] NMI backtrace for cpu 10 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816851] NMI backtrace for cpu 11 skipped: idling at pc 0xffffffff818b5ac6 Jul 31 16:32:41 teststorage [ 369.816855] Kernel panic - not syncing: hung_task: blocked tasks Jul 31 16:32:41 teststorage [ 369.816857] CPU: 3 PID: 107 Comm: khungtaskd Not tainted 4.12.3p1 #3 Jul 31 16:32:41 teststorage [ 369.816857] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
Jul 31 16:32:41 teststorage [  369.816857] Call Trace:
Jul 31 16:32:41 teststorage [  369.816859]  dump_stack+0x63/0x8e
Jul 31 16:32:41 teststorage [  369.816861]  panic+0xeb/0x239
Jul 31 16:32:41 teststorage [  369.816863]  watchdog+0x2d5/0x340
Jul 31 16:32:41 teststorage [  369.816864]  kthread+0x109/0x140
Jul 31 16:32:41 teststorage [ 369.816864] ? reset_hung_task_detector+0x20/0x20
Jul 31 16:32:41 teststorage [  369.816865]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.816866]  ? kthread_park+0x60/0x60
Jul 31 16:32:41 teststorage [  369.816866]  ret_from_fork+0x25/0x30

8<=====

Also, some more details on what your backend device storage behind
target-core looks like would be helpful.

Physical HBAs, device information, and any MD/RAID/LVM/etc on top..

I have adaptec 7805 HW RAID (6x SAS 600G 15k hdd in RAID0) - exporting 1 device to system (sdc), LVM sits on this device and LV (vgtest/test) is used as LIO block backstore and is exported to ESXs.
(for system I have separate 2 SATA HDDs in software raid1 + LVM)

[root@TESTSTORAGE ~]# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdb                   8:16   0 465,8G  0 disk
└─md126               9:126  0 465,8G  0 raid1
  ├─md126p3         259:2    0   314G  0 md
  │ ├─vgsystem-swap 253:1    0    64G  0 lvm
  │ ├─vgsystem-root 253:0    0   100G  0 lvm   /
  │ └─vgsystem-home 253:3    0   200G  0 lvm   /home
  ├─md126p1         259:0    0   200M  0 md    /boot/efi
  ├─md126p4         259:3    0 150,6G  0 md
  │ └─vgsystem-root 253:0    0   100G  0 lvm   /
  └─md126p2         259:1    0     1G  0 md    /boot
sdc                   8:32   1   3,3T  0 disk
└─vgtest-test       253:2    0   3,3T  0 lvm
sda                   8:0    0 465,8G  0 disk
└─md126               9:126  0 465,8G  0 raid1
  ├─md126p3         259:2    0   314G  0 md
  │ ├─vgsystem-swap 253:1    0    64G  0 lvm
  │ ├─vgsystem-root 253:0    0   100G  0 lvm   /
  │ └─vgsystem-home 253:3    0   200G  0 lvm   /home
  ├─md126p1         259:0    0   200M  0 md    /boot/efi
  ├─md126p4         259:3    0 150,6G  0 md
  │ └─vgsystem-root 253:0    0   100G  0 lvm   /
  └─md126p2         259:1    0     1G  0 md    /boot

8<=========

--
Lukasz Engel

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux