Re: tcm_fc crash

Jun Wu <jwu@xxxxxxxxxxxx> · Thu, 17 Apr 2014 16:01:38 -0700

On the initiator side, I run fio and see the following messages in dmesg:

[ 3294.893951] BUG: soft lockup - CPU#2 stuck for 22s! [systemd-udevd:4665]
[ 3294.895491] Modules linked in: target_core_pscsi target_core_file
target_core_iblock ipmi_devintf ipmi_si ipmi_msghandler ib_srpt
tcm_qla2xxx qla2xxx tcm_loop tcm_fc iscsi_target_mod target_core_mod
configfs 8021q garp stp mrp llc fcoe libfcoe libfc scsi_transport_fc
scsi_tgt ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi radeon ttm
drm_kms_helper drm intel_powerclamp coretemp kvm_intel kvm gpio_ich
microcode psmouse serio_raw lpc_ich ioatdma i7core_edac edac_core
shpchp mac_hid lp parport ext2 ses enclosure pata_acpi hid_generic igb
ixgbe usbhid i2c_algo_bit dca hid pata_jmicron ptp aacraid mdio
pps_core
[ 3294.895533] CPU: 2 PID: 4665 Comm: systemd-udevd Tainted: GF
W    3.11.0-18-generic #32-Ubuntu
[ 3294.895534] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c       10/28/2011
[ 3294.895536] task: ffff880628592ee0 ti: ffff88062a52e000 task.ti:
ffff88062a52e000
[ 3294.895538] RIP: 0010:[<ffffffff810c64ae>]  [<ffffffff810c64ae>]
smp_call_function_many+0x26e/0x2d0
[ 3294.895542] RSP: 0018:ffff88062a52fad8  EFLAGS: 00000202
[ 3294.895544] RAX: 0000000000000007 RBX: ffffffff81d04dc0 RCX: ffff88063fc77fd0
[ 3294.895545] RDX: 0000000000000007 RSI: 0000000000000100 RDI: 0000000000000000
[ 3294.895547] RBP: ffff88062a52fb28 R08: ffff880333c550c8 R09: 0000000000000004
[ 3294.895549] R10: ffff880333c550c8 R11: 0000000000000005 R12: ffff88062a52fb10
[ 3294.895550] R13: 0000000000000282 R14: ffff88062a52fa78 R15: ffff880333c54580
[ 3294.895553] FS:  00007f2603f63880(0000) GS:ffff880333c40000(0000)
knlGS:0000000000000000
[ 3294.895555] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3294.895556] CR2: 00007f260308b212 CR3: 0000000001c0e000 CR4: 00000000000007e0
[ 3294.895558] Stack:
[ 3294.895559]  ffff880333c550e8 0000000000015080 0000000000000000
ffffffff811d8450
[ 3294.895564]  0000010000000001 ffff88062a52fb78 ffffffff811d8450
0000000000000000
[ 3294.895567]  0000000000000002 0000000000000100 ffff88062a52fb58
ffffffff810c659a
[ 3294.895571] Call Trace:
[ 3294.895575]  [<ffffffff811d8450>] ? __brelse+0x40/0x40
[ 3294.895579]  [<ffffffff811d8450>] ? __brelse+0x40/0x40
[ 3294.895582]  [<ffffffff810c659a>] on_each_cpu_mask+0x2a/0x60
[ 3294.895585]  [<ffffffff811d7690>] ? mark_buffer_async_write+0x20/0x20
[ 3294.895588]  [<ffffffff810c6684>] on_each_cpu_cond+0xb4/0xe0
[ 3294.895591]  [<ffffffff811d8450>] ? __brelse+0x40/0x40
[ 3294.895594]  [<ffffffff811d8009>] invalidate_bh_lrus+0x29/0x30
[ 3294.895597]  [<ffffffff811dec0e>] kill_bdev+0x1e/0x30
[ 3294.895600]  [<ffffffff811e0206>] __blkdev_put+0x66/0x1b0
[ 3294.895603]  [<ffffffff811e0c6e>] blkdev_put+0x4e/0x140
[ 3294.895606]  [<ffffffff811e0e15>] blkdev_close+0x25/0x30
[ 3294.895610]  [<ffffffff811a9821>] __fput+0xe1/0x230
[ 3294.895613]  [<ffffffff811a99be>] ____fput+0xe/0x10
[ 3294.895616]  [<ffffffff81081554>] task_work_run+0xc4/0xe0
[ 3294.895620]  [<ffffffff810642b7>] do_exit+0x2b7/0xa40
[ 3294.895623]  [<ffffffff811c1991>] ? touch_atime+0x71/0x140
[ 3294.895627]  [<ffffffff81141098>] ? generic_file_aio_read+0x588/0x700
[ 3294.895630]  [<ffffffff81064abf>] do_group_exit+0x3f/0xa0
[ 3294.895633]  [<ffffffff810743c0>] get_signal_to_deliver+0x1d0/0x5e0
[ 3294.895636]  [<ffffffff811df6dc>] ? blkdev_aio_read+0x4c/0x70
[ 3294.895640]  [<ffffffff81012438>] do_signal+0x48/0x960
[ 3294.895644]  [<ffffffff81012dc8>] do_notify_resume+0x78/0xa0
[ 3294.895647]  [<ffffffff816f86da>] int_signal+0x12/0x17
[ 3294.895649] Code: 3b 05 bf fa c3 00 89 c2 0f 8d 20 fe ff ff 48 98
49 8b 4d 00 48 03 0c c5 80 40 d0 81 f6 41 20 01 74 cb 0f 1f 00 f3 90
f6 41 20 01 <75> f8 eb be 0f b6 4d d0 48 8b 55 c0 89 df 48 8b 75 c8 e8
fb fa

On Tue, Apr 15, 2014 at 9:03 AM, Jun Wu <jwu@xxxxxxxxxxxx> wrote:
> Hello,
>
> We are working on a cluster file system using fcoe vn2vn. Multiple
> initiators can see the same set of target hard drives exported by targetcli
> tcm_fc. When the initiators run IO to these target hard drives at the same
> time, target system crashes no matter using iblock backstore or pscsi
> backstore. See the following dump.
>
> crash> bt
> PID: 318    TASK: ffff880c1a05aee0  CPU: 5   COMMAND: "kworker/5:1"
>  #0 [ffff880c1a895a48] machine_kexec at ffffffff810485e2
>  #1 [ffff880c1a895a98] crash_kexec at ffffffff810d09d3
>  #2 [ffff880c1a895b60] oops_end at ffffffff816f0c98
>  #3 [ffff880c1a895b88] die at ffffffff8101616b
>  #4 [ffff880c1a895bb8] do_trap at ffffffff816f04b0
>  #5 [ffff880c1a895c08] do_invalid_op at ffffffff810134a8
>  #6 [ffff880c1a895cb0] invalid_op at ffffffff816f9c1e
>     [exception RIP: ft_queue_data_in+1386]
>     RIP: ffffffffa0641eda  RSP: ffff880c1a895d68  RFLAGS: 00010246
>     RAX: 0000000000001000  RBX: ffff880c17a6dc10  RCX: 0000000000000002
>     RDX: 0000000000000000  RSI: ffff880c1afa36d8  RDI: 0000000000000000
>     RBP: ffff880c1a895df8   R8: ffff880c1667e45c   R9: dfcf2970a166dd90
>     R10: dfcf2970a166dd90  R11: 0000000000000000  R12: ffff880c17a6dc10
>     R13: ffff880c3fc33e00  R14: 0000000000001000  R15: 0000000000000140
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffff880c1a895d60] ft_queue_data_in at ffffffffa06419c7 [tcm_fc]
>  #8 [ffff880c1a895e00] target_complete_ok_work at ffffffffa04ded21
> [target_core_
> mod]
>  #9 [ffff880c1a895e28] process_one_work at ffffffff8107d0ec
> #10 [ffff880c1a895e70] worker_thread at ffffffff8107dd3c
> #11 [ffff880c1a895ed0] kthread at ffffffff810848d0
> #12 [ffff880c1a895f50] ret_from_fork at ffffffff816f836c
>
> Is there any way to avoid this problem?
> Thanks,
>
> Jun
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html