Re: System crashes with increased drive count

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2014-05-13 at 14:52 -0700, Jun Wu wrote:
> Hi Nicholas,
> 
> We had to roll back system from 3.14 to 3.11 due to compile issues of
> our software. So I am not able to verify your fix at this point.

That is unfortunate your stuck on a now unsupported stable kernel.
There are some other libfc related fixes that have gone in during the
v3.13 timeframe, so I'd strongly recommend upgrading to at least that
stable version.

In any event, I'll be pushing that particular >= v3.13.y patch anyways,
as it's a obvious regression bugfix for percpu-ida pre-allocation.

> I ran the same tests on 3.11 instead.
> 
> In one case the target crashed with following message:
> 
> May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
> request at ffffffffffffffa4
> May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
> _raw_spin_lock_bh+0x17/0x40
> May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
> May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
> May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
> target_core_file target_core_iblock iscsi_target_mod target_core_mod
> ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
> bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
> ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
> iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
> kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
> edac_core dca mperf radeon i2c_algo_bit
> May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
> i2c_core pata_acpi pata_jmicron aacraid
> May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
> tainted 3.11.10-301.fc20.x86_64 #1
> May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
> BIOS 2.1c       10/28/2011
> May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod]
> May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000
> May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
> May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0  EFLAGS: 00010206
> May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000
> May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4
> May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678
> May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000
> May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c
> May 13 13:06:25 poc2 kernel: FS:  0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
> May 13 13:06:25 poc2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0
> May 13 13:06:25 poc2 kernel: Stack:
> May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410
> May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018
> May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000
> May 13 13:06:25 poc2 kernel: Call Trace:
> May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc]
> May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc]
> May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc]
> May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
> May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
> May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
> May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
> May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
> May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
> May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
> May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
> May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
> f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
> 05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
> 00 00 f3 90 0f b6
> May 13 13:06:25 poc2 kernel: RIP  [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
> 

So before we start debugging again, please confirm that this is a
*completely* stock v3.11.10 build, and that your not building
out-of-tree target modules again.

> 
> In another case, the initiator crashed with:
> 
> May 13 12:00:47 poc1 kernel: [ 4086.708455] WARNING: CPU: 1 PID: 1869
> at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
> May 13 12:00:47 poc1 kernel: [ 4086.708459] list_del corruption.
> next->prev should be ffff88061dab0318, but was ffff88061d257318
> May 13 12:00:47 poc1 kernel: [ 4086.708461] Modules linked in: fcoe
> libfcoe 8021q garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt
> target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
> target_core_mod nf_conntrack_netbios_ns nf_conntrack_broadcast
> ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
> bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> iptable_security iptable_raw coretemp kvm_intel kvm crc32c_intel
> iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 ses igb
> enclosure lpc_ich mfd_core ixgbe ptp pps_core mdio i7core_edac ioatdma
> edac_core dca shpchp acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd
> sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm ata_generic i2c_core
> pata_acpi pata_jmicron aacraid
> May 13 12:00:47 poc1 kernel: [ 4086.708556] CPU: 1 PID: 1869 Comm:
> fcoethread/1 Not tainted 3.11.10-301.fc20.x86_64 #1
> May 13 12:00:47 poc1 kernel: [ 4086.708558] Hardware name: Supermicro
> X8DTN/X8DTN, BIOS 2.1c       10/28/2011
> May 13 12:00:47 poc1 kernel: [ 4086.708561]  0000000000000009 ffff8806129dfb40 ffffffff816441db ffff8806129dfb88
> May 13 12:00:47 poc1 kernel: [ 4086.708569]  ffff8806129dfb78 ffffffff8106715d ffff88061dab0318 ffff88061dab0a00
> May 13 12:00:47 poc1 kernel: [ 4086.708576]  0000000000000286 ffff880c1b5e4388 0000000000000030 ffff8806129dfbd8
> May 13 12:00:47 poc1 kernel: [ 4086.708582] Call Trace:
> May 13 12:00:47 poc1 kernel: [ 4086.708592]  [<ffffffff816441db>] dump_stack+0x45/0x56
> May 13 12:00:47 poc1 kernel: [ 4086.708598]  [<ffffffff8106715d>] warn_slowpath_common+0x7d/0xa0
> May 13 12:00:47 poc1 kernel: [ 4086.708602]  [<ffffffff810671cc>] warn_slowpath_fmt+0x4c/0x50
> May 13 12:00:47 poc1 kernel: [ 4086.708608]  [<ffffffff81311dc2>] __list_del_entry+0x82/0xd0
> May 13 12:00:47 poc1 kernel: [ 4086.708613]  [<ffffffff81311e1d>] list_del+0xd/0x30
> May 13 12:00:47 poc1 kernel: [ 4086.708624]  [<ffffffffa05de23c>] fc_io_compl+0x1cc/0x710 [libfc]
> May 13 12:00:47 poc1 kernel: [ 4086.708633]  [<ffffffffa05de7df>] fc_fcp_complete_locked+0x5f/0x1a0 [libfc]
> May 13 12:00:47 poc1 kernel: [ 4086.708642]  [<ffffffffa05dfac9>] fc_fcp_resp.isra.22+0x79/0x2f0 [libfc]
> May 13 12:00:47 poc1 kernel: [ 4086.708651]  [<ffffffff810a2a33>] ? load_balance+0xe3/0x740
> May 13 12:00:47 poc1 kernel: [ 4086.708660]  [<ffffffffa05e0424>] fc_fcp_recv+0x6e4/0xef0 [libfc]
> May 13 12:00:47 poc1 kernel: [ 4086.708666]  [<ffffffff810115ce>] ? __switch_to+0x13e/0x4b0
> May 13 12:00:47 poc1 kernel: [ 4086.708673]  [<ffffffff8164aab5>] ? _raw_spin_unlock_bh+0x15/0x20
> May 13 12:00:47 poc1 kernel: [ 4086.708682]  [<ffffffffa05dfd40>] ? fc_fcp_resp.isra.22+0x2f0/0x2f0 [libfc]
> May 13 12:00:47 poc1 kernel: [ 4086.708690]  [<ffffffffa05d421b>] fc_exch_recv+0x8eb/0xd70 [libfc]
> May 13 12:00:47 poc1 kernel: [ 4086.708695]  [<ffffffffa0613299>] fcoe_percpu_receive_thread+0x299/0x540 [fcoe]
> May 13 12:00:47 poc1 kernel: [ 4086.708699]  [<ffffffffa0613000>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
> May 13 12:00:47 poc1 kernel: [ 4086.708705]  [<ffffffff81088660>] kthread+0xc0/0xd0
> May 13 12:00:47 poc1 kernel: [ 4086.708710]  [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
> May 13 12:00:47 poc1 kernel: [ 4086.708717]  [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
> May 13 12:00:47 poc1 kernel: [ 4086.708723]  [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
> May 13 12:00:47 poc1 kernel: [ 4086.708728] ---[ end trace 61dc774d1f379191 ]---
> 

No idea on the initiator side issue.  Intel folks..?  (Adding
openfcoe-dev CC')

--nab

> [root@poc1 log]# lspci | grep 82599
> 08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> SFI/SFP+ Network Connection (rev 01)
> 
> [root@poc1 log]# uname -a
> Linux poc1 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013
> x86_64 x86_64 x86_64 GNU/Linux
> 
> Thanks,
> 


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux