Re: System crashes with increased drive count

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nicholas,

We downloaded the source of our running kernel (3.13.10-200) and
applied your percpu-ida pre-allocation regression fix, then compiled
and installed the kernel. I repeated the same test three times,
running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
In the first two tests, the target machine hung with the following
messages:

15231 May 19 11:49:27 poc1 kernel: [ 1073.783229] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
lso_max <0x10000>
15232 May 19 11:49:27 poc1 kernel: [ 1073.783238] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
lso_max <0x10000>
15233 May 19 11:49:27 poc1 kernel: [ 1073.783242] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
lso_max <0x10000>
15234 May 19 11:49:27 poc1 kernel: [ 1073.783245] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
lso_max <0x10000>
15235 May 19 11:49:30 poc1 kernel: [ 1076.907061] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
lso_max <0x10000>
15236 May 19 11:49:30 poc1 kernel: [ 1076.907068] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
lso_max <0x10000>
15237 May 19 11:49:30 poc1 kernel: [ 1076.907073] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
lso_max <0x10000>
15238 May 19 11:49:30 poc1 kernel: [ 1076.907077] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
lso_max <0x10000>
15239 May 19 11:50:01 poc1 kernel: [ 1107.918910] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
lso_max <0x10000>
15240 May 19 11:50:01 poc1 kernel: [ 1107.918918] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
lso_max <0x10000>
15241 May 19 11:50:01 poc1 kernel: [ 1107.918922] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
lso_max <0x10000>
15242 May 19 11:50:01 poc1 kernel: [ 1107.918925] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
lso_max <0x10000>
15243 May 19 11:50:01 poc1 kernel: [ 1107.918929] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
lso_max <0x10000>
15244 May 19 11:50:01 poc1 kernel: [ 1107.918932] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
lso_max <0x10000>
15245 May 19 11:50:01 poc1 kernel: [ 1107.918936] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
lso_max <0x10000>
15246 May 19 11:50:01 poc1 kernel: [ 1107.918939] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
lso_max <0x10000>
15247 May 19 11:50:05 poc1 kernel: [ 1111.450900] ft_queue_data_in:
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
lso_max <0x10000>
15248 May 19 11:50:05 poc1 kernel: [ 1111.450908] ft_queue_data_in:
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
lso_max <0x10000>
15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
callbacks suppressed
15250 May 19 11:51:12 poc1 kernel: [ 1178.698440] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
lso_max <0x10000>
15251 May 19 11:51:12 poc1 kernel: [ 1178.698446] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
lso_max <0x10000>
15252 May 19 11:51:12 poc1 kernel: [ 1178.698449] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
lso_max <0x10000>
15253 May 19 11:51:12 poc1 kernel: [ 1178.698453] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
lso_max <0x10000>
15254 May 19 11:51:12 poc1 kernel: [ 1178.698456] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
lso_max <0x10000>
15255 May 19 11:51:12 poc1 kernel: [ 1178.698460] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
lso_max <0x10000>
15256 May 19 11:51:12 poc1 kernel: [ 1178.698463] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
lso_max <0x10000>
15257 May 19 11:51:12 poc1 kernel: [ 1178.698467] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
lso_max <0x10000>


In the third test, after above lines, some additional messages were printed:

15258 May 19 11:52:46 poc1 kernel: [ 1272.969464] BUG: soft lockup -
CPU#0 stuck for 22s! [fcoethread/0:3372]
15259 May 19 11:52:46 poc1 kernel: [ 1272.969468] Modules linked in:
fcoe libfcoe ipt_MASQUERADE xt_CHECKSUM tcm_fc libfc scsi_transport_fc
scsi_tgt target_core_pscsi target_core_file target_core_iblock
iscsi_target_mod target_core_mod 8      021q garp mrp ip6t_rpfilter
ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6tabl
 e_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security
iptable_raw coretemp kvm_intel kvm crc32c_intel iTCO_wdt
iTCO_vendor_support gpio_ich microcode serio_      raw lpc_ich
i2c_i801 mfd_core ioatdma ses enclosure i7core_edac edac_core shpchp
acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm ixgbe drm igb ata_generic mdio pata_acpi ptp
pps_core i2c_algo_bit pata_j      micron aacraid dca i2c_core
15260 May 19 11:52:46 poc1 kernel: [ 1272.969537] CPU: 0 PID: 3372
Comm: fcoethread/0 Not tainted 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
15261 May 19 11:52:46 poc1 kernel: [ 1272.969539] Hardware name:
Supermicro X8DTN/X8DTN, BIOS 2.1c       10/28/2011
15262 May 19 11:52:46 poc1 kernel: [ 1272.969541] task:
ffff88061cb29140 ti: ffff88061e14c000 task.ti: ffff88061e14c000
15263 May 19 11:52:46 poc1 kernel: [ 1272.969543] RIP:
0010:[<ffffffff8168e834>]  [<ffffffff8168e834>]
_raw_spin_lock_bh+0x34/0x40
15264 May 19 11:52:46 poc1 kernel: [ 1272.969552] RSP:
0018:ffff88061e14de10  EFLAGS: 00000282
15265 May 19 11:52:46 poc1 kernel: [ 1272.969554] RAX:
000000000000aa8f RBX: ffff880627c14580 RCX: 000000000000136f
15266 May 19 11:52:46 poc1 kernel: [ 1272.969556] RDX:
000000000000136f RSI: ffff88061cb29140 RDI: ffff880627c153fc
15267 May 19 11:52:46 poc1 kernel: [ 1272.969558] RBP:
ffff88061e14de18 R08: ffff88061e14c000 R09: 0000000000000004
15268 May 19 11:52:46 poc1 kernel: [ 1272.969560] R10:
0000000000000004 R11: 0000000000000005 R12: ffff880c09bfd800
15269 May 19 11:52:46 poc1 kernel: [ 1272.969561] R13:
ffff880627c14580 R14: 0000000000000000 R15: 0000000000000000
15270 May 19 11:52:46 poc1 kernel: [ 1272.969564] FS:
0000000000000000(0000) GS:ffff880627c00000(0000)
knlGS:0000000000000000
15271 May 19 11:52:46 poc1 kernel: [ 1272.969566] CS:  0010 DS: 0000
ES: 0000 CR0: 000000008005003b
15272 May 19 11:52:46 poc1 kernel: [ 1272.969568] CR2:
00007f3c2c851000 CR3: 0000000001c0c000 CR4: 00000000000007f0
15273 May 19 11:52:46 poc1 kernel: [ 1272.969570] Stack:
15274 May 19 11:52:46 poc1 kernel: [ 1272.969572]  ffff880627c153e0
ffff88061e14dec8 ffffffffa06590ed ffff880627c14580
15275 May 19 11:52:46 poc1 kernel: [ 1272.969577]  ffff880c1d00d400
0000000000000000 ffff880627c14580 ffff88061cb29140
15276 May 19 11:52:46 poc1 kernel: [ 1272.969581]  ffff88061cb29140
ffff880627c153fc ffff880627c153e0 ffff88060c314dce
15277 May 19 11:52:46 poc1 kernel: [ 1272.969585] Call Trace:
15278 May 19 11:52:46 poc1 kernel: [ 1272.969592]
[<ffffffffa06590ed>] fcoe_percpu_receive_thread+0xdd/0x53c [fcoe]
15279 May 19 11:52:46 poc1 kernel: [ 1272.969597]
[<ffffffffa0659010>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
15280 May 19 11:52:46 poc1 kernel: [ 1272.969603]
[<ffffffff8108f2f2>] kthread+0xd2/0xf0
15281 May 19 11:52:46 poc1 kernel: [ 1272.969607]
[<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
15282 May 19 11:52:46 poc1 kernel: [ 1272.969613]
[<ffffffff81696dbc>] ret_from_fork+0x7c/0xb0
15283 May 19 11:52:46 poc1 kernel: [ 1272.969617]
[<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
15284 May 19 11:52:46 poc1 kernel: [ 1272.969618] Code: 53 48 89 fb e8
4e 34 9e ff b8 00 00 01 00 f0 0f c1 03 89 c2 c1 ea 10 66 39 c2 75 03
5b 5d c3 0f b7 03 89 d1 66 39 d0 74 f3 f3 90 <0f> b7 03 66 39 c8 75 f6
eb e7 66 90 66 66 66 66 90       55 48 89 e5

I didn't see the previous message "unable to handle kernel NULL
pointer dereference at 0000000000000048". So it must have been fixed
by your change.

Thanks,

Jun

On Wed, May 14, 2014 at 6:57 PM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Wed, 2014-05-14 at 17:23 -0700, Jun Wu wrote:
>> On Wed, May 14, 2014 at 2:39 PM, Nicholas A. Bellinger
>> <nab@xxxxxxxxxxxxxxx> wrote:
>> > On Tue, 2014-05-13 at 14:52 -0700, Jun Wu wrote:
>> >> Hi Nicholas,
>> >>
>> >> We had to roll back system from 3.14 to 3.11 due to compile issues of
>> >> our software. So I am not able to verify your fix at this point.
>> >
>> > That is unfortunate your stuck on a now unsupported stable kernel.
>> > There are some other libfc related fixes that have gone in during the
>> > v3.13 timeframe, so I'd strongly recommend upgrading to at least that
>> > stable version.
>> >
>> > In any event, I'll be pushing that particular >= v3.13.y patch anyways,
>> > as it's a obvious regression bugfix for percpu-ida pre-allocation.
>> >
>> >> I ran the same tests on 3.11 instead.
>> >>
>> >> In one case the target crashed with following message:
>> >>
>> >> May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
>> >> request at ffffffffffffffa4
>> >> May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
>> >> _raw_spin_lock_bh+0x17/0x40
>> >> May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
>> >> May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
>> >> May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
>> > garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
>> >> target_core_file target_core_iblock iscsi_target_mod target_core_mod
>> >> ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
>> >> bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
>> >> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
>> >> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
>> >> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
>> >> iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
>> >> ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
>> >> iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
>> >> kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
>> >> edac_core dca mperf radeon i2c_algo_bit
>> >> May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
>> >> i2c_core pata_acpi pata_jmicron aacraid
>> >> May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
>> >> tainted 3.11.10-301.fc20.x86_64 #1
>> >> May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
>> >> BIOS 2.1c       10/28/2011
>> >> May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod]
>> >> May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000
>> >> May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
>> >> May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0  EFLAGS: 00010206
>> >> May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000
>> >> May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4
>> >> May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678
>> >> May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000
>> >> May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c
>> >> May 13 13:06:25 poc2 kernel: FS:  0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
>> >> May 13 13:06:25 poc2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> >> May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0
>> >> May 13 13:06:25 poc2 kernel: Stack:
>> >> May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410
>> >> May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018
>> >> May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000
>> >> May 13 13:06:25 poc2 kernel: Call Trace:
>> >> May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc]
>> >> May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc]
>> >> May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc]
>> >> May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
>> >> May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
>> >> May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
>> >> May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
>> >> May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
>> >> May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
>> >> May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
>> >> May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
>> >> May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
>> >> f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
>> >> 05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
>> >> 00 00 f3 90 0f b6
>> >> May 13 13:06:25 poc2 kernel: RIP  [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
>> >>
>> >
>> > So before we start debugging again, please confirm that this is a
>> > *completely* stock v3.11.10 build, and that your not building
>> > out-of-tree target modules again.
>>
>> Yes we installed official Fedora 20 iso image and then
>> yum install targetcli
>> modprobe fcoe
>>
>> Kernel version is 3.11.10-301.fc20.x86_64 which is a stock kernel distribution.
>>
>
> So I'd still recommend >= v3.13.y in order to pick-up the other libfc
> related bugfixes that have gone in since v3.11.y ran out of stable
> support some months ago.  I can't tell if that is related to the bug(s)
> that your hitting on both initiator + target sides here, but it
> definitely would not hurt.
>
> Also, here is the updated >= v3.13.y percpu-ida pre-allocation
> regression fix that will be going into v3.15-rc6 shortly.  Note this is
> different from the original version, so please make sure you use this
> one instead.
>
> https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?id=e7c36ac16b91bf09e94f82d1838f0f9f114bed08
>
> --nab
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux