Re: tcm_fc crash

Jun Wu <jwu@xxxxxxxxxxxx> · Fri, 25 Apr 2014 10:43:40 -0700

Hi Nicholas,

Sorry to respond to you late. I have collected the information you want.

Kernel version:
root@poc1:~# uname -a
 Linux poc1 3.11.0-18-generic #32-Ubuntu SMP Tue Feb 18 21:11:14 UTC
2014 x86_64 x86_64 x86_64 GNU/Linux

NIC:
root@poc1:~# lspci | grep 82599
 08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)

Backstores:
Here is the targetcli output of the target machine. It has 6 hard
drives exported to 2 initiators.
/> ls
o- / ..................................................................... [...]
  o- backstores .......................................................... [...]
  | o- fileio ............................................... [0 Storage Object]
  | o- iblock .............................................. [6 Storage Objects]
  | | o- diskb ............................................ [/dev/sdb activated]
  | | o- diskc ............................................ [/dev/sdc activated]
  | | o- diskd ............................................ [/dev/sdd activated]
  | | o- diske ............................................ [/dev/sde activated]
  | | o- diskf ............................................ [/dev/sdf activated]
  | | o- diskg ............................................ [/dev/sdg activated]
  | o- pscsi ................................................ [0 Storage Object]
  | o- rd_dr ................................................ [0 Storage Object]
  | o- rd_mcp ............................................... [0 Storage Object]
  o- ib_srpt ........................................................ [0 Target]
  o- iscsi .......................................................... [0 Target]
  o- loopback ....................................................... [0 Target]
  o- qla2xxx ........................................................ [0 Target]
  o- tcm_fc ......................................................... [1 Target]
    o- 20:00:00:25:90:ef:03:ec ....................................... [enabled]
      o- acls ......................................................... [2 ACLs]
      | o- 20:00:00:25:90:ef:06:1e ............................. [6 Mapped LUNs]
      | | o- mapped_lun0 ........................................... [lun0 (rw)]
      | | o- mapped_lun1 ........................................... [lun1 (rw)]
      | | o- mapped_lun2 ........................................... [lun2 (rw)]
      | | o- mapped_lun3 ........................................... [lun3 (rw)]
      | | o- mapped_lun4 ........................................... [lun4 (rw)]
      | | o- mapped_lun5 ........................................... [lun5 (rw)]
      | o- 20:00:00:25:90:ef:06:2a ............................. [6 Mapped LUNs]
      |   o- mapped_lun0 ........................................... [lun0 (rw)]
      |   o- mapped_lun1 ........................................... [lun1 (rw)]
      |   o- mapped_lun2 ........................................... [lun2 (rw)]
      |   o- mapped_lun3 ........................................... [lun3 (rw)]
      |   o- mapped_lun4 ........................................... [lun4 (rw)]
      |   o- mapped_lun5 ........................................... [lun5 (rw)]
      o- luns ......................................................... [6 LUNs]
        o- lun0 ...................................... [iblock/diskc (/dev/sdc)]
        o- lun1 ...................................... [iblock/diskd (/dev/sdd)]
        o- lun2 ...................................... [iblock/diske (/dev/sde)]
        o- lun3 ...................................... [iblock/diskf (/dev/sdf)]
        o- lun4 ...................................... [iblock/diskg (/dev/sdg)]
        o- lun5 ...................................... [iblock/diskb (/dev/sdb)]

By compiling tcm_fc, we found the RIP (ft_queue_data_in+1386) points
to tfc_io.c:94.
 91         /*
 92          * Setup to use first mem list entry, unless no data.
 93          */
 94         BUG_ON(remaining && !se_cmd->t_data_sg);
 95         if (remaining) {
 96                 sg = se_cmd->t_data_sg;
 97                 mem_len = sg->length;
 98                 mem_off = sg->offset;
 99                 page = sg_page(sg);
100         }

That is BUG_ON(remaining && !se_cmd->t_data_sg).

root@poc1:~# modinfo tcm_fc
filename:
/lib/modules/3.11.0-18-generic/kernel/drivers/target/tcm_fc/tcm_fc.ko
license:        GPL
description:    FC TCM fabric driver 0.4
srcversion:     68B468A9E0DB43CC9653984
depends:        target_core_mod,libfc
vermagic:       3.11.0-18-generic SMP mod_unload modversions
parm:           debug_logging:a bit mask of logging levels (int)

On the 2 initiators, run fio to all the 6 hard drives on the target at
the same time. The target crashes within a few seconds every time at
the same RIP.

Thanks,

Jun

On Thu, Apr 17, 2014 at 5:03 PM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> Hi Jun,
>
> On Tue, 2014-04-15 at 09:15 -0700, Jun Wu wrote:
>> Hello,
>>
>> We are working on a cluster file system using fcoe vn2vn. Multiple
>> initiators can see the same set of target hard drives exported by
>> targetcli tcm_fc. When the initiators run IO to these target hard
>> drives at the same time, target system crashes no matter using iblock
>> backstore or pscsi backstore. See the following dump.
>>
>> crash> bt
>> PID: 318    TASK: ffff880c1a05aee0  CPU: 5   COMMAND: "kworker/5:1"
>>  #0 [ffff880c1a895a48] machine_kexec at ffffffff810485e2
>>  #1 [ffff880c1a895a98] crash_kexec at ffffffff810d09d3
>>  #2 [ffff880c1a895b60] oops_end at ffffffff816f0c98
>>  #3 [ffff880c1a895b88] die at ffffffff8101616b
>>  #4 [ffff880c1a895bb8] do_trap at ffffffff816f04b0
>>  #5 [ffff880c1a895c08] do_invalid_op at ffffffff810134a8
>>  #6 [ffff880c1a895cb0] invalid_op at ffffffff816f9c1e
>>     [exception RIP: ft_queue_data_in+1386]
>>     RIP: ffffffffa0641eda  RSP: ffff880c1a895d68  RFLAGS: 00010246
>>     RAX: 0000000000001000  RBX: ffff880c17a6dc10  RCX: 0000000000000002
>>     RDX: 0000000000000000  RSI: ffff880c1afa36d8  RDI: 0000000000000000
>>     RBP: ffff880c1a895df8   R8: ffff880c1667e45c   R9: dfcf2970a166dd90
>>     R10: dfcf2970a166dd90  R11: 0000000000000000  R12: ffff880c17a6dc10
>>     R13: ffff880c3fc33e00  R14: 0000000000001000  R15: 0000000000000140
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>  #7 [ffff880c1a895d60] ft_queue_data_in at ffffffffa06419c7 [tcm_fc]
>>  #8 [ffff880c1a895e00] target_complete_ok_work at ffffffffa04ded21
>> [target_core_
>>
>>                             mod]
>>  #9 [ffff880c1a895e28] process_one_work at ffffffff8107d0ec
>> #10 [ffff880c1a895e70] worker_thread at ffffffff8107dd3c
>> #11 [ffff880c1a895ed0] kthread at ffffffff810848d0
>> #12 [ffff880c1a895f50] ret_from_fork at ffffffff816f836c
>>
>> Is there any way to avoid this problem?
>
> Can you be a bit more specific on the setup..?  Eg: kernel version on
> the target, NICs, backstores, etcs.
>
> Also, it might be useful if you can run the RIP (ft_queue_data_in+1386)
> through gdb with your kernel source to see where the bug is actually
> pointing.
>
> (Also, CC'ing some of the Intel FCoE folks)
>
> --nab
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html