Re: tcm_fc crash

Jun Wu <jwu@xxxxxxxxxxxx> · Mon, 28 Apr 2014 14:36:17 -0700

 Hi Nicholas,

We upgraded Ubuntu 13.10 to the latest 14.04 which has a
3.13.0-24-generic kernel. I reproduced the bug with v3.13. V3.14
doesn't compile on the kernel. The tfc_io.c is the same in v3.14.

The initiator I am using is open-fcoe-3.11.

root@poc2:~# modinfo fcoe
filename:       /lib/modules/3.13.0-24-generic/kernel/drivers/scsi/fcoe/fcoe.ko
license:        GPL v2
description:    FCoE
author:         Open-FCoE.org
srcversion:     6DA44562FC66B71637941E8
depends:        libfcoe,libfc,scsi_transport_fc
intree:         Y
vermagic:       3.13.0-24-generic SMP mod_unload modversions
signer:         Magrathea: Glacier signing key
sig_key:        00:A5:A6:57:59:DE:47:4B:C5:C4:31:20:88:0C:1B:94:A5:39:F4:31
sig_hashalgo:   sha512
parm:           ddp_min:Minimum I/O size in bytes for Direct Data
Placement (DDP). (uint)
parm:           debug_logging:a bit mask of logging levels (int)

root@poc2:~# fcoeadm -v
1.0.29

By issuing "echo eth2 > /sys/module/libfcoe/parameters/create_vn2vn",
the initiator can see the target drives.

I always run fio with 4KB sequential read on all the drives to
reproduce the bug. Here is the crash dump with your debug information:

[ 2883.272203] TARGET_CORE[fc]: Unsupported SCSI Opcode 0x85, sending
CHECK_CONDITION.
[ 3473.092860] CDB: 0x28 data_length: 4096 t_data_sg:           (null)
t_data_nents: 0se_cmd_flags: 0x00000109
[ 3473.092887] ------------[ cut here ]------------
[ 3473.092958] kernel BUG at /home/zb/target3.13/target/tcm_fc/tfc_io.c:100!
[ 3473.093056] invalid opcode: 0000 [#1] SMP
[ 3473.093123] Modules linked in: ib_srpt tcm_qla2xxx qla2xxx
tcm_loop(OF) tcm_fc(OF) iscsi_target_mod(OF) target_core_pscsi(OF)
target_core_file(OF) target_core_iblock(OF) target_core_mod(OF)
configfs 8021q garp stp mrp llc fcoe libfcoe libfc scsi_transport_fc
scsi_tgt ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi radeon ttm
drm_kms_helper drm gpio_ich intel_powerclamp coretemp kvm_intel kvm
ioatdma psmouse serio_raw lpc_ich i7core_edac edac_core shpchp mac_hid
lp parport hid_generic ses enclosure pata_acpi igb ixgbe usbhid
i2c_algo_bit hid dca pata_jmicron ptp mdio aacraid pps_core
[ 3473.094223] CPU: 9 PID: 183 Comm: kworker/9:1 Tainted: GF
O 3.13.0-24-generic #46-Ubuntu
[ 3473.094347] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c       10/28/2011
[ 3473.094462] Workqueue: target_completion target_complete_ok_work
[target_core_mod]
[ 3473.094573] task: ffff88061b82dfc0 ti: ffff88061b966000 task.ti:
ffff88061b966000
[ 3473.094678] RIP: 0010:[<ffffffffa056905b>]  [<ffffffffa056905b>]
ft_queue_data_in+0x57b/0x580 [tcm_fc]
[ 3473.094817] RSP: 0018:ffff88061b967d78  EFLAGS: 00010286
[ 3473.094892] RAX: 000000000000005f RBX: ffff88060a5b6790 RCX: 0000000000000000
[ 3473.094990] RDX: ffff880627caffe0 RSI: ffff880627cae3c8 RDI: 0000000000000246
[ 3473.095090] RBP: ffff88061b967df8 R08: 0000000000000092 R09: 00000000000006ef
[ 3473.095189] R10: 0000000000000000 R11: ffff88061b967aa6 R12: ffff88060a5b6790
[ 3473.095288] R13: ffff88060a5b68d8 R14: 0000000000001000 R15: 0000000000000240
[ 3473.095388] FS:  0000000000000000(0000) GS:ffff880627ca0000(0000)
knlGS:0000000000000000
[ 3473.095502] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3473.095581] CR2: 00007f8588b45000 CR3: 0000000001c0e000 CR4: 00000000000007e0
[ 3473.095680] Stack:
[ 3473.095710]  ffff88061b82dfc0 0000000000000000 ffff88060a5b68b0
ffff88061bae9bdc
[ 3473.095831]  0000000000000000 0000000000000240 ffff88061b967db8
ffff8806111a76e8
[ 3473.095951]  ffff88061b967de0 ffff88060a5b6790 ffff88061b967df8
ffff88060a5b68d8
[ 3473.096070] Call Trace:
[ 3473.096114]  [<ffffffffa052112c>]
target_complete_ok_work+0x16c/0x2d0 [target_core_mod]
[ 3473.096230]  [<ffffffff810838a2>] process_one_work+0x182/0x450
[ 3473.096315]  [<ffffffff81084641>] worker_thread+0x121/0x410
[ 3473.096393]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
[ 3473.096478]  [<ffffffff8108b312>] kthread+0xd2/0xf0
[ 3473.096546]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
[ 3473.096641]  [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
[ 3473.096718]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
[ 3473.096810] Code: 0f 0b 48 8b 5d c8 31 c9 48 c7 c7 40 a7 56 a0 48
8b 83 e8 00 00 00 44 8b 4b 20 44 8b 83 78 01 00 00 0f b6 30 31 c0 e8
c4 66 1a e1 <0f> 0b 0f 1f 00 66 66 66 66 90 55 48 85 ff 48 89 e5 53 48
89 fb
[ 3473.097355] RIP  [<ffffffffa056905b>] ft_queue_data_in+0x57b/0x580 [tcm_fc]
[ 3473.097457]  RSP <ffff88061b967d78>
crash>
crash> bt
PID: 183    TASK: ffff88061b82dfc0  CPU: 9   COMMAND: "kworker/9:1"
 #0 [ffff88061b967a58] machine_kexec at ffffffff8104a732
 #1 [ffff88061b967aa8] crash_kexec at ffffffff810e6ab3
 #2 [ffff88061b967b70] oops_end at ffffffff8171ef68
 #3 [ffff88061b967b98] die at ffffffff810171cb
 #4 [ffff88061b967bc8] do_trap at ffffffff8171e660
 #5 [ffff88061b967c18] do_invalid_op at ffffffff81014512
 #6 [ffff88061b967cc0] invalid_op at ffffffff81727c5e
    [exception RIP: ft_queue_data_in+1403]
    RIP: ffffffffa056905b  RSP: ffff88061b967d78  RFLAGS: 00010286
    RAX: 000000000000005f  RBX: ffff88060a5b6790  RCX: 0000000000000000
    RDX: ffff880627caffe0  RSI: ffff880627cae3c8  RDI: 0000000000000246
    RBP: ffff88061b967df8   R8: 0000000000000092   R9: 00000000000006ef
    R10: 0000000000000000  R11: ffff88061b967aa6  R12: ffff88060a5b6790
    R13: ffff88060a5b68d8  R14: 0000000000001000  R15: 0000000000000240
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88061b967d70] ft_queue_data_in at ffffffffa056905b [tcm_fc]
 #8 [ffff88061b967e00] target_complete_ok_work at ffffffffa052112c [target_core_
 #9 [ffff88061b967e28] process_one_work at ffffffff810838a2
#10 [ffff88061b967e70] worker_thread at ffffffff81084641
#11 [ffff88061b967ed0] kthread at ffffffff8108b312
#12 [ffff88061b967f50] ret_from_fork at ffffffff8172637c

Thanks,

Jun

On Fri, Apr 25, 2014 at 3:29 PM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Fri, 2014-04-25 at 10:43 -0700, Jun Wu wrote:
>> Hi Nicholas,
>>
>> Sorry to respond to you late. I have collected the information you want.
>>
>> Kernel version:
>> root@poc1:~# uname -a
>>  Linux poc1 3.11.0-18-generic #32-Ubuntu SMP Tue Feb 18 21:11:14 UTC
>> 2014 x86_64 x86_64 x86_64 GNU/Linux
>>
>> NIC:
>> root@poc1:~# lspci | grep 82599
>>  08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
>> SFI/SFP+ Network Connection (rev 01)
>>
>
> Thanks for the additional info.  Please also provide the specifics of
> the FCoE initiator setup as well.
>
>> Backstores:
>> Here is the targetcli output of the target machine. It has 6 hard
>> drives exported to 2 initiators.
>> /> ls
>> o- / ..................................................................... [...]
>>   o- backstores .......................................................... [...]
>>   | o- fileio ............................................... [0 Storage Object]
>>   | o- iblock .............................................. [6 Storage Objects]
>>   | | o- diskb ............................................ [/dev/sdb activated]
>>   | | o- diskc ............................................ [/dev/sdc activated]
>>   | | o- diskd ............................................ [/dev/sdd activated]
>>   | | o- diske ............................................ [/dev/sde activated]
>>   | | o- diskf ............................................ [/dev/sdf activated]
>>   | | o- diskg ............................................ [/dev/sdg activated]
>>   | o- pscsi ................................................ [0 Storage Object]
>>   | o- rd_dr ................................................ [0 Storage Object]
>>   | o- rd_mcp ............................................... [0 Storage Object]
>>   o- ib_srpt ........................................................ [0 Target]
>>   o- iscsi .......................................................... [0 Target]
>>   o- loopback ....................................................... [0 Target]
>>   o- qla2xxx ........................................................ [0 Target]
>>   o- tcm_fc ......................................................... [1 Target]
>>     o- 20:00:00:25:90:ef:03:ec ....................................... [enabled]
>>       o- acls ......................................................... [2 ACLs]
>>       | o- 20:00:00:25:90:ef:06:1e ............................. [6 Mapped LUNs]
>>       | | o- mapped_lun0 ........................................... [lun0 (rw)]
>>       | | o- mapped_lun1 ........................................... [lun1 (rw)]
>>       | | o- mapped_lun2 ........................................... [lun2 (rw)]
>>       | | o- mapped_lun3 ........................................... [lun3 (rw)]
>>       | | o- mapped_lun4 ........................................... [lun4 (rw)]
>>       | | o- mapped_lun5 ........................................... [lun5 (rw)]
>>       | o- 20:00:00:25:90:ef:06:2a ............................. [6 Mapped LUNs]
>>       |   o- mapped_lun0 ........................................... [lun0 (rw)]
>>       |   o- mapped_lun1 ........................................... [lun1 (rw)]
>>       |   o- mapped_lun2 ........................................... [lun2 (rw)]
>>       |   o- mapped_lun3 ........................................... [lun3 (rw)]
>>       |   o- mapped_lun4 ........................................... [lun4 (rw)]
>>       |   o- mapped_lun5 ........................................... [lun5 (rw)]
>>       o- luns ......................................................... [6 LUNs]
>>         o- lun0 ...................................... [iblock/diskc (/dev/sdc)]
>>         o- lun1 ...................................... [iblock/diskd (/dev/sdd)]
>>         o- lun2 ...................................... [iblock/diske (/dev/sde)]
>>         o- lun3 ...................................... [iblock/diskf (/dev/sdf)]
>>         o- lun4 ...................................... [iblock/diskg (/dev/sdg)]
>>         o- lun5 ...................................... [iblock/diskb (/dev/sdb)]
>>
>> By compiling tcm_fc, we found the RIP (ft_queue_data_in+1386) points
>> to tfc_io.c:94.
>>  91         /*
>>  92          * Setup to use first mem list entry, unless no data.
>>  93          */
>>  94         BUG_ON(remaining && !se_cmd->t_data_sg);
>>  95         if (remaining) {
>>  96                 sg = se_cmd->t_data_sg;
>>  97                 mem_len = sg->length;
>>  98                 mem_off = sg->offset;
>>  99                 page = sg_page(sg);
>> 100         }
>>
>> That is BUG_ON(remaining && !se_cmd->t_data_sg).
>>
>
> So let's find out a little more about the CDB that is triggering the
> bug.
>
> Please apply the following patch to your v3.11 tree to dump the se_cmd
> in question when the bug is triggered in ft_queue_data_in():
>
> diff --git a/drivers/target/tcm_fc/tfc_io.c b/drivers/target/tcm_fc/tfc_io.c
> index e415af3..8009407 100644
> --- a/drivers/target/tcm_fc/tfc_io.c
> +++ b/drivers/target/tcm_fc/tfc_io.c
> @@ -91,7 +91,13 @@ int ft_queue_data_in(struct se_cmd *se_cmd)
>         /*
>          * Setup to use first mem list entry, unless no data.
>          */
> -       BUG_ON(remaining && !se_cmd->t_data_sg);
> +       if (remaining && !se_cmd->t_data_sg) {
> +               printk("CDB: 0x%02x data_length: %u t_data_sg: %p t_data_nents: %u"
> +                       "se_cmd_flags: 0x%08x\n", se_cmd->t_task_cdb[0],
> +                       se_cmd->data_length, se_cmd->t_data_sg,
> +                       se_cmd->t_data_nents, se_cmd->se_cmd_flags);
> +               BUG();
> +       }
>         if (remaining) {
>                 sg = se_cmd->t_data_sg;
>                 mem_len = sg->length;
>
>
>> root@poc1:~# modinfo tcm_fc
>> filename:
>> /lib/modules/3.11.0-18-generic/kernel/drivers/target/tcm_fc/tcm_fc.ko
>> license:        GPL
>> description:    FC TCM fabric driver 0.4
>> srcversion:     68B468A9E0DB43CC9653984
>> depends:        target_core_mod,libfc
>> vermagic:       3.11.0-18-generic SMP mod_unload modversions
>> parm:           debug_logging:a bit mask of logging levels (int)
>>
>> On the 2 initiators, run fio to all the 6 hard drives on the target at
>> the same time. The target crashes within a few seconds every time at
>> the same RIP.
>>
>
> So I don't see any tcm_fc specific changes in v3.11 code that would be
> causing such a bug, nor any v3.11.y bugfixes in this area that would
> apply.
>
> Also since the bug is easy to reproduce with multiple initiators, it
> might be worthwhile to try to reproduce with v3.14.y as well.
>
> Thanks,
>
> --nab
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html