Fwd: Red Hat (Fedora) bug report 1467674 concerning your kernel functional performance enhancements causing PCI Express crashes,

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[+cc linux-pci]

Thanks very much for the detailed problem report, Wim!  I'm taking the
liberty to forward to the linux-pci list in case others trip over the
same thing.


---------- Forwarded message ----------
From: Wim ten Have <wim.ten.have@xxxxxxxxxx>
Date: Tue, Jul 4, 2017 at 9:13 AM
Subject: Red Hat (Fedora) bug report 1467674 concerning your kernel
functional performance enhancements causing PCI Express crashes,
To: Sinan Kaya <okaya@xxxxxxxxxxxxxx>, Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Cc: Wim ten Have <wim.ten.have@xxxxxxxxxx>


        Howdy,

I created Red Hat (Fedora) bug report 1467674.
        https://bugzilla.redhat.com/show_bug.cgi?id=1467674

This may be in your interest given fact you were involved in creating
code that is causing kernel (oops)/malfunction under;

commit 60db3a4d8cc9073cf56264785197ba75ee1caca4
Author: Sinan Kaya <okaya@xxxxxxxxxxxxxx>
Date:   Fri Jan 20 09:16:51 2017 -0500

    PCI: Enable PCIe Extended Tags if supported

    Every PCIe device can generate 5-bit transaction Tags, which allow up to 32
    concurrent requests.  Some devices can generate 8-bit Extended Tags, which
    allow up to 256 concurrent requests.

    Per the ECN mentioned below, all PCIe Receivers are expected to support
    Extended Tags, so devices are allowed (but not required) to enable them by
    default.

    If a device supports Extended Tags but does not enable them by default,
    enable them.  This allows the device to have up to 256 outstanding
    transactions at a time, which may improve performance.

    [bhelgaas: changelog, check for PCIe device]
    Link: https://pcisig.com/sites/default/files/specification_documents/ECN_Extended_Tag_Enable_Default_05Sept2008_final.pdf
    Signed-off-by: Sinan Kaya <okaya@xxxxxxxxxxxxxx>
    Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>


>>>>>>>>>>>> REPORT <<<<<<<<<<<<<<<<
Wim ten Have 2017-07-04 10:06:19 EDT

Description of problem:
=======================
Systems with "eth0: Tigon3 [partno(BCM95721) rev 4201] (PCI Express)"
ethernet like DELL PowerEdge SC1435 fail their ethernet after
interface bind/ifconfig up.
  [    0.000000] SMBIOS 2.4 present.
  [    0.000000] DMI: Dell Inc. PowerEdge SC1435/0H313M, BIOS 2.2.5 03/21/2008


The problem is not specific to this piece of h/w.  I did pin-point the
issue to specific kernel code commit
60db3a4d8cc9073cf56264785197ba75ee1caca4
  * <wtenhave@hagen:55> git bisect good
    60db3a4d8cc9073cf56264785197ba75ee1caca4 is the first bad commit
    commit 60db3a4d8cc9073cf56264785197ba75ee1caca4
    Author: Sinan Kaya <okaya@xxxxxxxxxxxxxx>
    Date:   Fri Jan 20 09:16:51 2017 -0500

      PCI: Enable PCIe Extended Tags if supported

The system will shortly after getting to the ifconfig up statement
report below kernel messages.
  Jul  4 15:00:12 hagen kernel: tg3 0000:01:00.0 eth0: Tigon3
[partno(BCM95721) rev 4201] (PCI Express) MAC address
00:22:19:27:cd:f8
  Jul  4 15:00:12 hagen kernel: tg3 0000:01:00.0 eth0: attached PHY is
5750 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
  Jul  4 15:00:12 hagen kernel: tg3 0000:01:00.0 eth0: RXcsums[1]
LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
  Jul  4 15:00:12 hagen kernel: tg3 0000:01:00.0 eth0:
dma_rwctrl[76180000] dma_mask[64-bit]
  Jul  4 15:00:12 hagen kernel: tg3 0000:02:00.0 eth1: Tigon3
[partno(BCM95721) rev 4201] (PCI Express) MAC address
00:22:19:27:cd:f9
  Jul  4 15:00:12 hagen kernel: tg3 0000:02:00.0 eth1: attached PHY is
5750 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
  Jul  4 15:00:12 hagen kernel: tg3 0000:02:00.0 eth1: RXcsums[1]
LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
  Jul  4 15:00:12 hagen kernel: tg3 0000:02:00.0 eth1:
dma_rwctrl[76180000] dma_mask[64-bit]
     ...
  Jul  4 15:00:12 hagen kernel: tg3 0000:02:00.0 enp2s0: renamed from eth1
     ...
  Jul  4 15:00:39 hagen kernel: tg3 0000:01:00.0 enp1s0: Link is up at
1000 Mbps, full duplex
  Jul  4 15:00:39 hagen kernel: tg3 0000:01:00.0 enp1s0: Flow control
is on for TX and on for RX
     ...
  Jul  4 15:00:50 hagen kernel: NETDEV WATCHDOG: enp1s0 (tg3):
transmit queue 0 timed out
  Jul  4 15:00:50 hagen kernel: ------------[ cut here ]------------
  Jul  4 15:00:50 hagen kernel: WARNING: CPU: 6 PID: 0 at
net/sched/sch_generic.c:316 dev_watchdog+0x215/0x220
  Jul  4 15:00:50 hagen kernel: Modules linked in: ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
ebtable_broute bridge stp llc ip6table_raw ip6table_security
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle iptable_raw iptable_security iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
iptable_mangle ebtable_filter ebtables ip6table_filter ip6_tables
sunrpc xfs libcrc32c amd64_edac_mod edac_mce_amd kvm_amd kvm dcdbas
irqbypass ipmi_ssif acpi_cpufreq shpchp ipmi_si ipmi_devintf tpm_tis
pcspkr tpm_tis_core k10temp i2c_piix4 ipmi_msghandler tpm
target_core_mod amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper
ttm drm tg3 ptp ata_generic serio_raw pata_acpi pps_core
pata_serverworks sata_svw
  Jul  4 15:00:50 hagen kernel: CPU: 6 PID: 0 Comm: swapper/6 Not
tainted 4.12.0broken+ #16
  Jul  4 15:00:50 hagen kernel: Hardware name: Dell Inc. PowerEdge
SC1435/0H313M, BIOS 2.2.5 03/21/2008
  Jul  4 15:00:50 hagen kernel: task: ffff90a9ede5c980 task.stack:
ffffb123031b4000
  Jul  4 15:00:50 hagen kernel: RIP: 0010:dev_watchdog+0x215/0x220
  Jul  4 15:00:50 hagen kernel: RSP: 0018:ffff90a9efd83e60 EFLAGS: 00010286
  Jul  4 15:00:50 hagen kernel: RAX: 0000000000000039 RBX:
0000000000000000 RCX: 0000000000000000
  Jul  4 15:00:50 hagen kernel: RDX: 0000000000000000 RSI:
00000000000000f6 RDI: 0000000000000300
  Jul  4 15:00:50 hagen kernel: RBP: ffff90a9efd83e80 R08:
0000000000000000 R09: 0000000000000346
  Jul  4 15:00:50 hagen kernel: R10: ffff90a9efd92430 R11:
000000000000000f R12: ffff90a9ece46000
  Jul  4 15:00:50 hagen kernel: R13: 0000000000000006 R14:
0000000000000005 R15: ffff90a9ece46000
  Jul  4 15:00:50 hagen kernel: FS:  0000000000000000(0000)
GS:ffff90a9efd80000(0000) knlGS:0000000000000000
  Jul  4 15:00:50 hagen kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
  Jul  4 15:00:50 hagen kernel: CR2: 000055d4fa41f038 CR3:
00000004c3e09000 CR4: 00000000000006e0
  Jul  4 15:00:50 hagen kernel: Call Trace:
  Jul  4 15:00:50 hagen kernel: <IRQ>
  Jul  4 15:00:50 hagen kernel: ? qdisc_rcu_free+0x50/0x50
  Jul  4 15:00:50 hagen kernel: call_timer_fn+0x35/0x130
  Jul  4 15:00:50 hagen kernel: run_timer_softirq+0x1d1/0x420
  Jul  4 15:00:50 hagen kernel: ? sched_clock+0x9/0x10
  Jul  4 15:00:50 hagen kernel: ? sched_clock+0x9/0x10
  Jul  4 15:00:50 hagen kernel: ? sched_clock_cpu+0x11/0xb0
  Jul  4 15:00:50 hagen kernel: __do_softirq+0x10c/0x2a5
  Jul  4 15:00:50 hagen kernel: irq_exit+0xff/0x110
  Jul  4 15:00:50 hagen kernel: smp_apic_timer_interrupt+0x3d/0x50
  Jul  4 15:00:50 hagen kernel: apic_timer_interrupt+0x93/0xa0
  Jul  4 15:00:50 hagen kernel: RIP: 0010:native_safe_halt+0x6/0x10
  Jul  4 15:00:50 hagen kernel: RSP: 0018:ffffb123031b7e60 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff10
  Jul  4 15:00:50 hagen kernel: RAX: 6874754100002548 RBX:
ffff90a9ede5c980 RCX: 0000000000000000
  Jul  4 15:00:50 hagen kernel: RDX: 0000000000000000 RSI:
0000000000000000 RDI: 0000000000000000
  Jul  4 15:00:50 hagen kernel: RBP: ffffb123031b7e60 R08:
00000009c80ff365 R09: ffffb1230837fa38
  Jul  4 15:00:50 hagen kernel: R10: 0000000000000000 R11:
00000000fffc0fe9 R12: 0000000000000006
  Jul  4 15:00:50 hagen kernel: R13: ffff90a9ede5c980 R14:
0000000000000000 R15: 0000000000000000
  Jul  4 15:00:50 hagen kernel: </IRQ>
  Jul  4 15:00:50 hagen kernel: default_idle+0x20/0x100
  Jul  4 15:00:50 hagen kernel: amd_e400_idle+0x3f/0x50
  Jul  4 15:00:50 hagen kernel: arch_cpu_idle+0xf/0x20
  Jul  4 15:00:50 hagen kernel: default_idle_call+0x23/0x30
  Jul  4 15:00:50 hagen kernel: do_idle+0x174/0x1e0
  Jul  4 15:00:50 hagen kernel: cpu_startup_entry+0x71/0x80
  Jul  4 15:00:50 hagen kernel: start_secondary+0x154/0x190
  Jul  4 15:00:50 hagen kernel: secondary_startup_64+0x9f/0x9f
  Jul  4 15:00:50 hagen kernel: Code: 8c 24 64 04 00 00 eb 8f 4c 89 e7
c6 05 ef 0e 88 00 01 e8 4f 6b fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7
60 a0 d1 86 e8 12 cc a5 ff <0f> ff eb c1 0f 1f 80 00 00 00 00 0f 1f 44
00 00 48 c7 47 08 00
  Jul  4 15:00:50 hagen kernel: ---[ end trace 6fdc4540cb931145 ]---
  Jul  4 15:00:50 hagen kernel: tg3 0000:01:00.0 enp1s0: transmit
timed out, resetting
  Jul  4 15:00:52 hagen abrt-dump-journal-oops:
abrt-dump-journal-oops: Found oopses: 1
  Jul  4 15:00:52 hagen abrt-dump-journal-oops:
abrt-dump-journal-oops: Creating problem directories
  Jul  4 15:00:52 hagen kernel: tg3 0000:01:00.0 enp1s0: 0x00000000:
0x165914e4, 0x00100406, 0x02000021, 0x00000010
  Jul  4 15:00:52 hagen kernel: tg3 0000:01:00.0 enp1s0: 0x00000010:
0xefef0004, 0x00000000, 0x00000000, 0x00000000
    ...
  Jul  4 15:00:53 hagen kernel: tg3 0000:01:00.0 enp1s0: 0x00007810:
0x00000000, 0x00000060, 0x00000000, 0x00000000
  Jul  4 15:00:53 hagen kernel: tg3 0000:01:00.0 enp1s0: 0: Host
status block [00000001:00000014:(0000:0005:0000):(0005:000c)]
  Jul  4 15:00:53 hagen kernel: tg3 0000:01:00.0 enp1s0: 0: NAPI info
[00000014:00000014:(0010:000c:01ff):0005:(00cd:0000:0000:0000)]
  Jul  4 15:00:53 hagen kernel: tg3 0000:01:00.0: tg3_stop_block timed
out, ofs=4800 enable_bit=2
  Jul  4 15:00:53 hagen kernel: tg3 0000:01:00.0 enp1s0: Link is down
  Jul  4 15:00:53 hagen abrt-dump-journal-oops: Reported 1 kernel oopses to Abrt
  Jul  4 15:00:53 hagen abrt-server: Deleting problem directory
oops-2017-07-04-15:00:52-889-0 (dup of oops-2017-07-03-12:49:03-930-0)


Version-Release number of selected component (if applicable):
=============================================================
The issue was first noticeable under Fedora 25 updating the kernel
from version 4.10.x => 4.11.0.

Given I needed to move forward with latest version of available kernel
i decided to hunt the bug and report.
I found it to be cause under all (linux) kernels from commit
60db3a4d8cc9073cf56264785197ba75ee1caca4

* <wtenhave@hagen:55> git bisect good
  60db3a4d8cc9073cf56264785197ba75ee1caca4 is the first bad commit
  commit 60db3a4d8cc9073cf56264785197ba75ee1caca4
  Author: Sinan Kaya <okaya@xxxxxxxxxxxxxx>
  Date:   Fri Jan 20 09:16:51 2017 -0500

    PCI: Enable PCIe Extended Tags if supported



How reproducible:
=================
It is 100% reproducible.  In fact you can take latest kernel out today
and back-out the change as done under;
  commit 60db3a4d8cc9073cf56264785197ba75ee1caca4

  <wtenhave@hagen:55> git log 60db3a4d8cc9073cf56264785197ba75ee1caca4
  commit 60db3a4d8cc9073cf56264785197ba75ee1caca4
  Author: Sinan Kaya <okaya@xxxxxxxxxxxxxx>
  Date:   Fri Jan 20 09:16:51 2017 -0500

    PCI: Enable PCIe Extended Tags if supported

    Every PCIe device can generate 5-bit transaction Tags, which allow up to 32
    concurrent requests.  Some devices can generate 8-bit Extended Tags, which
    allow up to 256 concurrent requests.

    Per the ECN mentioned below, all PCIe Receivers are expected to support
    Extended Tags, so devices are allowed (but not required) to enable them by
    default.

    If a device supports Extended Tags but does not enable them by default,
    enable them.  This allows the device to have up to 256 outstanding
    transactions at a time, which may improve performance.

    [bhelgaas: changelog, check for PCIe device]
    Link: https://pcisig.com/sites/default/files/specification_documents/ECN_Extended_Tag_Enable_Default_05Sept2008_final.pdf
    Signed-off-by: Sinan Kaya <okaya@xxxxxxxxxxxxxx>
    Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>


To back-out take any kernel later then 4.11 and apply below code
change.  Then build and install that kernel.

  <wtenhave@hagen:58> git diff
  diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
  index 19c8950..1005e9d 100644
  --- a/drivers/pci/probe.c
  +++ b/drivers/pci/probe.c
  @@ -1707,7 +1707,7 @@ static void pci_configure_device(struct pci_dev *dev)
          int ret;

          pci_configure_mps(dev);
  -       pci_configure_extended_tags(dev);
  +       // pci_configure_extended_tags(dev);

          memset(&hpp, 0, sizeof(hpp));
          ret = pci_get_hp_params(dev, &hpp);


Steps to Reproduce:
===================
1. Take a machine with appropriate h/w like a Dell Inc. PowerEdge
SC1435/0H313M, BIOS 2.2.5 03/21/2008 with Tigon3 [partno(BCM95721) rev
4201] (PCI Express) controller
2. Install Fedora25 or (other) with kernel including specific
code/commit like kernel-4.11.7-200.fc25.x86_64
3. Boot and see it crash as soon it starts to operate on specific PCI
Express Ethernet controller.

Actual results:

Expected results:

Additional info:
================
If you need further input please drop me a line at;
"Wim ten Have <wim.ten.have@xxxxxxxxxx>"

Regards,
- Wim.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux