Hello
I am experiencing problems using LIO FC target with vmware ESX (6.5).
Thisis the same problem as discussed in
https://www.spinics.net/lists/target-devel/msg15670.html
I've tried 4.11.x (up to 12) and 4.12.3(and also some kernels from 4.9
line).
My setup is 3 ESX 6.5 hosts using FC target storage with 2x QLogic 2460
(4gbit) (I also have another storage with QLE-2562 (8gbit)- same problem
occurs).
I am able to reproduce the problem with almost 100%efficiency -
generating large i/o load by concurrently resuming many VMs - problem
always occurs within minute.
* on clean4.12.3:
BUG occurs on almost always first ABORT_TASK issued by vmware:
[...]
Jul 27 17:06:40 teststorage [ 31.336673] qla2xxx
[0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success
ffff881f80d6dc00
Jul 27 17:06:40 teststorage [ 708.583638] Detected MISCOMPARE for addr:
ffff881f8fb15000 buf: ffff881f7a46aa00
Jul 27 17:06:40 teststorage [ 708.583651] Target/iblock: Send
MISCOMPARE check condition and sense
Jul 27 17:06:42 teststorage [ 710.627503] ABORT_TASK: Found referenced
qla2xxx task_tag: 1168472
Jul 27 17:06:43 teststorage [ 711.454055] ------------[ cut here
]------------
Jul 27 17:06:43 teststorage [ 711.454062] kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3643!
Jul 27 17:06:43 teststorage [ 711.454067] invalid opcode: 0000 [#1] SMP
Jul 27 17:06:43 teststorage [ 711.454070] Modules linked in:
tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file
target_core_iblock iscsi_target_mod target_core_mod netconsole lm90 vfat
fat intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul ses
crc32_pclmul enclosure ghash_clmulni_intel scsi_transport_sas iTCO_wdt
iTCO_vendor_support lpc_ich i2c_i801 mei_me mei tpm_tis tpm_tis_core tpm
nfsd auth_rpcgss nfs_acl lockd grace dm_multipath sunrpc binfmt_misc
nouveau video drm_kms_helper ttm raid1 drm qla2xxx igb e1000e mxm_wmi
aacraid dca ptp crc32c_intel pps_core scsi_transport_fc i2c_algo_bit wmi
Jul 27 17:06:43 teststorage [ 711.454101] CPU: 3 PID: 209 Comm:
kworker/u24:7 Not tainted 4.12.3 #1
Jul 27 17:06:43 teststorage [ 711.454104] Hardware name: ASUS All
Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
Jul 27 17:06:43 teststorage [ 711.454118] Workqueue: tmr-iblock
target_tmr_work [target_core_mod]
Jul 27 17:06:43 teststorage [ 711.454122] task: ffff881f82173b00
task.stack: ffffc9000d770000
Jul 27 17:06:43 teststorage [ 711.454132] RIP:
0010:qlt_free_cmd+0x138/0x150 [qla2xxx]
Jul 27 17:06:43 teststorage [ 711.454135] RSP: 0018:ffffc9000d773d60
EFLAGS: 00010286
Jul 27 17:06:43 teststorage [ 711.454138] RAX: 0000000000000088 RBX:
ffff881f8ebb8bf8 RCX: ffffffffa019fade
Jul 27 17:06:43 teststorage [ 711.454142] RDX: 000000000000e074 RSI:
ffff881f76fc87c0 RDI: 0000000000004000
Jul 27 17:06:43 teststorage [ 711.454145] RBP: ffffc9000d773d80 R08:
ffffffffa018c07c R09: ffff881f8ebb8bf8
Jul 27 17:06:43 teststorage [ 711.454148] R10: ffffea007e076ac0 R11:
ffff881f81dab640 R12: ffff881f80d6d800
Jul 27 17:06:43 teststorage [ 711.454151] R13: ffff881f8ebb8c80 R14:
0000000000000286 R15: ffff881f8f5e2000
Jul 27 17:06:43 teststorage [ 711.454155] FS: 0000000000000000(0000)
GS:ffff881f9f2c0000(0000) knlGS:0000000000000000
Jul 27 17:06:43 teststorage [ 711.454158] CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
Jul 27 17:06:43 teststorage [ 711.454161] CR2: 00007fff00782a68 CR3:
0000001f81e33000 CR4: 00000000003406e0
Jul 27 17:06:43 teststorage [ 711.454164] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jul 27 17:06:43 teststorage [ 711.454167] DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Jul 27 17:06:43 teststorage [ 711.454170] Call Trace:
Jul 27 17:06:43 teststorage [ 711.454176]
tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
Jul 27 17:06:43 teststorage [ 711.454183]
target_put_sess_cmd+0xce/0x140 [target_core_mod]
Jul 27 17:06:43 teststorage [ 711.454190]
core_tmr_abort_task+0x127/0x190 [target_core_mod]
Jul 27 17:06:43 teststorage [ 711.454197] target_tmr_work+0x111/0x120
[target_core_mod]
Jul 27 17:06:43 teststorage [ 711.454203] process_one_work+0x144/0x370
Jul 27 17:06:43 teststorage [ 711.454206] worker_thread+0x4d/0x3c0
Jul 27 17:06:43 teststorage [ 711.454210] kthread+0x109/0x140
Jul 27 17:06:43 teststorage [ 711.454487] ? rescuer_thread+0x360/0x360
Jul 27 17:06:43 teststorage [ 711.455029] ? kthread_park+0x60/0x60
Jul 27 17:06:43 teststorage [ 711.455579] ret_from_fork+0x25/0x30
Jul 27 17:06:43 teststorage [ 711.456122] Code: 7f a6 73 e1 0f b6 83 8c
02 00 00 e9 54 ff ff ff 48 8b bb c0 02 00 00 48 89 de e8 04 fd ff ff 0f
b6 83 8c 02 00 00 e9 35 ff ff ff <0f> 0b 48 8b bb 90 02 00 00 e8 8a e2
0a e1 e9 44 ff ff ff 0f 0b
Jul 27 17:06:43 teststorage [ 711.457279] RIP: qlt_free_cmd+0x138/0x150
[qla2xxx] RSP: ffffc9000d773d60
Jul 27 17:06:43 teststorage [ 711.457857] ---[ end trace
99e519b9a5b6591b ]---
8<-----------
I have also tried patch from
https://www.spinics.net/lists/target-devel/msg15759.html
(4db6a8145940d0bbd10265020d681961ce2d3238 - currently not available in
git(?))
The results is - no BUGs, system survives a little longer, but it's a
matter of minutes when kworker hangs:
Jul 26 15:21:33 teststorage [ 3588.629920] ABORT_TASK: Found referenced
qla2xxx task_tag: 1202220
Jul 26 15:21:34 teststorage [ 3589.491615] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1202220
Jul 26 15:21:34 teststorage [ 3589.491700] ABORT_TASK: Found referenced
qla2xxx task_tag: 1202264
Jul 26 15:25:16 teststorage [ 3810.752583] INFO: task kworker/u24:2:2975
blocked for more than 120 seconds.
Jul 26 15:25:16 teststorage [ 3810.752672] Not tainted 4.12.3 #1
Jul 26 15:25:16 teststorage [ 3810.752749] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 15:25:16 teststorage [ 3810.752831] kworker/u24:2 D 0
2975 2 0x00000080
Jul 26 15:25:16 teststorage [ 3810.752920] Workqueue: tmr-iblock
target_tmr_work [target_core_mod]
Jul 26 15:25:16 teststorage [ 3810.753006] Call Trace:
Jul 26 15:25:16 teststorage [ 3810.753207] __schedule+0x285/0x840
Jul 26 15:25:16 teststorage [ 3810.753410] schedule+0x36/0x80
Jul 26 15:25:16 teststorage [ 3810.753623] schedule_timeout+0x242/0x2f0
Jul 26 15:25:16 teststorage [ 3810.753832] ? radix_tree_lookup+0xd/0x10
Jul 26 15:25:16 teststorage [ 3810.754046] ? get_work_pool+0x2d/0x40
Jul 26 15:25:16 teststorage [ 3810.754263] ? flush_work+0x14d/0x190
Jul 26 15:25:16 teststorage [ 3810.754486] wait_for_completion+0x111/0x170
Jul 26 15:25:16 teststorage [ 3810.754717] ? wake_up_q+0x80/0x80
Jul 26 15:25:16 teststorage [ 3810.754949]
__transport_wait_for_tasks+0xa7/0x140 [target_core_mod]
Jul 26 15:25:16 teststorage [ 3810.755189]
transport_wait_for_tasks+0x53/0x90 [target_core_mod]
Jul 26 15:25:16 teststorage [ 3810.755454]
core_tmr_abort_task+0x10e/0x190 [target_core_mod]
Jul 26 15:25:16 teststorage [ 3810.755717] target_tmr_work+0x111/0x120
[target_core_mod]
Jul 26 15:25:16 teststorage [ 3810.755972] process_one_work+0x144/0x370
Jul 26 15:25:16 teststorage [ 3810.756236] worker_thread+0x4d/0x3c0
Jul 26 15:25:16 teststorage [ 3810.756507] kthread+0x109/0x140
Jul 26 15:25:16 teststorage [ 3810.756785] ? rescuer_thread+0x360/0x360
Jul 26 15:25:16 teststorage [ 3810.757060] ? kthread_park+0x60/0x60
Jul 26 15:25:16 teststorage [ 3810.757348] ? do_syscall_64+0x67/0x150
Jul 26 15:25:16 teststorage [ 3810.757636] ret_from_fork+0x25/0x30
Jul 26 15:25:16 teststorage [ 3810.757927] NMI backtrace for cpu 3
Jul 26 15:25:16 teststorage [ 3810.758220] CPU: 3 PID: 107 Comm:
khungtaskd Not tainted 4.12.3 #1
Jul 26 15:25:16 teststorage [ 3810.758530] Hardware name: ASUS All
Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
(I have also tried 4.13-rc2, but currently I have some other problem
with initiating qla2xxx there ("qla2xxx 0000:09:00.0: can't allocate
MSI-X affinity masks for 2 vectors"), so I can't use FC target...)
I am willing to provide more information if needed - It's currently my
out-of-production storage and so if anyone can look at this problem I
can on my side experiment and play with patches, retest, etc.
Any help appreciated :)
regards,
Lukasz Engel
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html