Re: Kernel oops while using AER inject

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 03, 2020 at 04:00:04PM +0000, Hinko Kocevar wrote:
> Note that I had to manually remove following devices to make the recovery report success:
> 
> echo 1 > /sys/bus/pci/devices/0000\:02\:02.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:02\:08.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:02\:09.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:02\:0a.0/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.1/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.2/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.3/remove 
> echo 1 > /sys/bus/pci/devices/0000\:01\:00.4/remove 
> 
> After that, the PCI device tree looks like this:
> 
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>  └─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>     └─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>        └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>           ├─04:00.0 downstream_port, slot 10, power: Off
>           ├─04:01.0 downstream_port, slot 4, device present, power: Off, speed 8GT/s, width x4
>           │  └─06:00.0 endpoint, Research Centre Juelich (1796), device 0024
>           ├─04:02.0 downstream_port, slot 9, power: Off
>           ├─04:03.0 downstream_port, slot 3, device present, power: Off, speed 8GT/s, width x4
>           │  └─08:00.0 endpoint, Research Centre Juelich (1796), device 0024
>           ├─04:08.0 downstream_port, slot 5, device present, power: Off, speed 2.5GT/s, width x4
>           │  └─09:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>           ├─04:09.0 downstream_port, slot 11, power: Off
>           ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 2.5GT/s, width x4
>           │  └─0b:00.0 endpoint, current speed 2.5GT/s target speed 8GT/s, Research Centre Juelich (1796), device 0024
>           ├─04:0b.0 downstream_port, slot 12, power: Off
>           ├─04:10.0 downstream_port, slot 8, power: Off
>           ├─04:11.0 downstream_port, slot 2, device present, power: Off, speed 2.5GT/s, width x1
>           │  └─0e:00.0 endpoint, Xilinx Corporation (10ee), device 7011
>           └─04:12.0 downstream_port, slot 7, power: Off
> 00:01.1 root_port, slot 2, device present
> 00:1c.0 root_port, slot 4, device present, speed 2.5GT/s, width x1
>  └─15:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
> 00:1c.1 root_port, slot 5, device present, speed 2.5GT/s, width x1
>  └─16:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
> 00:1c.2 root_port, slot 6, device present, speed 2.5GT/s, width x1
>  └─17:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Backplane Connection (1537)
> 00:1c.3 root_port, "J6B1", slot 7, device present, speed 2.5GT/s, width x1
>  └─18:00.0 endpoint, Intel Corporation (8086) I210 Gigabit Network Connection (1533)
> 
> Finally, here is the result of the AER recovery taking place upon injecting the fatal uncorrectable error into the 00:01.0 slot.
> 
> Dec  3 15:25:30 bd-cpu18 kernel: aer 0000:00:01.0:pcie002: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0008
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Requester ID)
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
> Dec  3 15:25:30 bd-cpu18 kernel: pcieport 0000:00:01.0:    [14] Completion Timeout    
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:30 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_error_detected] called .. state=2
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_result_none] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:06:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:08:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:09:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: sis8300 0000:0b:00.0: [aer_resume] UC errors cleared!
> Dec  3 15:25:31 bd-cpu18 kernel: mrf-pci 0000:0e:00.0: [aer_resume] called..
> Dec  3 15:25:31 bd-cpu18 kernel: pcieport 0000:00:01.0: AER: Device recovery successful
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
> Dec  3 15:25:33 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:08:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:06:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:09:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_open] available = 1
> Dec  3 15:25:34 bd-cpu18 kernel: sis8300 0000:0b:00.0: [sis8300_release] available = 1, count = 0
> Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
> Dec  3 15:25:47 bd-cpu18 dbus[1266]: [system] Successfully activated service 'org.freedesktop.problems'
> 
> At this point the PCI space reads for the AMCs returns 0xFFFFFFFF.
> 
> Below are messages captured after issuing 'echo 1 > /sys/bus/pci/rescan'. After that the CPU card rebooted by itself.
> 
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.2: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.3: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:01:00.4: Max Payload Size set to 256 (was 128, max 1024)
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:02:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pci 0000:02:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:03:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:08.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:09.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0a.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:0b.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:10.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:11.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: pcieport 0000:04:12.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> Dec  3 15:25:59 bd-cpu18 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
> Dec  3 15:25:59 bd-cpu18 kernel: IP: [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
> Dec  3 15:25:59 bd-cpu18 kernel: PGD 80000001767c4067 PUD 431939067 PMD 0 
> Dec  3 15:25:59 bd-cpu18 kernel: Oops: 0000 [#1] SMP 
> Dec  3 15:25:59 bd-cpu18 kernel: Modules linked in: aer_inject sis8300drv(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mei_wdt intel_wmi_thunderbolt intel_pmc_core snd_hda_codec_hdmi intel_powerclamp coretemp intel_rapl kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core irqbypass crc32_pclmul snd_hwdep ghash_clmulni_intel snd_seq snd_seq_device aesni_intel lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr snd_timer snd i2c_i801 soundcore sg i2c_designware_platform i2c_designware_core mei_me mei ie31200_edac
> Dec  3 15:25:59 bd-cpu18 kernel: wmi pinctrl_sunrisepoint pinctrl_intel tpm_crb acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic i915 iosf_mbi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul igb libahci crct10dif_common crc32c_intel drm libata serio_raw ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks mrf(OE) parport uio i2c_hid video [last unloaded: sis8300drv]
> Dec  3 15:25:59 bd-cpu18 kernel: CPU: 4 PID: 8150 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.6.1.el7.x86_64.debug #1
> Dec  3 15:25:59 bd-cpu18 kernel: Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
> Dec  3 15:25:59 bd-cpu18 kernel: task: ffff9b22b5f88000 ti: ffff9b22b5bc4000 task.ti: ffff9b22b5bc4000
> Dec  3 15:25:59 bd-cpu18 kernel: RIP: 0010:[<ffffffffc0b182f7>]  [<ffffffffc0b182f7>] aer_inj_read_config+0x87/0x160 [aer_inject]
> Dec  3 15:25:59 bd-cpu18 kernel: RSP: 0018:ffff9b22b5bc7a30  EFLAGS: 00010046
> Dec  3 15:25:59 bd-cpu18 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> Dec  3 15:25:59 bd-cpu18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b25718ab000
> Dec  3 15:25:59 bd-cpu18 kernel: RBP: ffff9b22b5bc7a68 R08: ffff9b22b5bc7a84 R09: ffffffffc0b1a0b0
> Dec  3 15:25:59 bd-cpu18 kernel: R10: 0000000000000001 R11: 69c1fefdd26da26d R12: 0000000000000000
> Dec  3 15:25:59 bd-cpu18 kernel: R13: ffffffffc0b1a050 R14: ffff9b22b5bc7a84 R15: ffff9b25718ab000
> Dec  3 15:25:59 bd-cpu18 kernel: FS:  00007f3d00e8b740(0000) GS:ffff9b259ce00000(0000) knlGS:0000000000000000
> Dec  3 15:25:59 bd-cpu18 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Dec  3 15:25:59 bd-cpu18 kernel: CR2: 0000000000000000 CR3: 0000000175796000 CR4: 00000000003607e0
> Dec  3 15:25:59 bd-cpu18 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Dec  3 15:25:59 bd-cpu18 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Dec  3 15:25:59 bd-cpu18 kernel: Call Trace:
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c61a9c>] pci_bus_read_config_dword+0x8c/0xb0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c63a68>] pci_bus_read_dev_vendor_id+0x28/0xe0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] pci_scan_single_device+0x74/0xf0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65eba>] pci_scan_slot+0x3a/0x140
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c6711d>] pci_scan_child_bus_extend+0x4d/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c62bcb>] ? pci_bus_write_config_dword+0x6b/0x80
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f4b>] pci_scan_bridge_extend+0x47b/0x5e0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c67286>] pci_scan_child_bus_extend+0x1b6/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c673d0>] pci_scan_child_bus+0x10/0x20
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c66f01>] pci_scan_bridge_extend+0x431/0x5e0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c65e04>] ? pci_scan_single_device+0x74/0xf0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c671e7>] pci_scan_child_bus_extend+0x117/0x2f0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c674b6>] pci_rescan_bus+0x16/0x40
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2c70d78>] bus_rescan_store+0x78/0xa0
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2d56909>] bus_attr_store+0x29/0x30
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b5272a>] sysfs_kf_write+0x4a/0x60
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2b51ff6>] kernfs_fop_write+0x106/0x190
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2aaf3fc>] vfs_write+0xdc/0x240
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ad5984>] ? fget_light+0x3c4/0x550
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa2ab029a>] SyS_write+0x8a/0x100
> Dec  3 15:25:59 bd-cpu18 kernel: [<ffffffffa3098b12>] system_call_fastpath+0x25/0x2a
> Dec  3 15:25:59 bd-cpu18 kernel: Code: 81 f9 b0 a0 b1 c0 74 5c 4d 3b 79 10 75 ee 49 8b 41 18 4d 8b af b8 00 00 00 89 de 49 89 87 b8 00 00 00 4d 89 f0 44 89 e2 4c 89 ff <48> 8b 00 e8 41 f3 10 e2 48 8b 75 d0 89 c3 4d 89 af b8 00 00 00 

I believe this is a known issue with aer_inject when you re-enumerate
devices below a bridge with injected errors. I reported it here:

  https://lore.kernel.org/linux-pci/20180918235848.26694-3-keith.busch@xxxxxxxxx/

Essentially, the re-enumerated devices inherit the injected bus_ops, but
aer_inject doesn't know about those devices.

The solution I proposed, however, had some CPU architectural and kernel
config limitations that made it less appealing.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux