On 04/19/2018 02:03 PM, Borislav Petkov wrote: > (snip useful explanation). > > On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote: >> On the r740xd, FW just hides those errors from the OS with no further >> notification. On this machine BIOS sets things up such that non-posted >> requests report fatal (PCIe) errors. FW still tries very hard to hide >> this from the OS, and I think the heuristic is that if the drive >> physical presence is gone, don't even report the error. > > Ok, second question: can you detect from the error signatures alone that > it was a surprise removal? I suppose you could make some inference, given the timing of other events going on around the the crash. It's not uncommon to see a "Card not present" event around drive removal. Since the presence detect pin breaks last, you might not get that interrupt for a long while. In that case it's much harder to determine if you're seeing a SURPRISE!!! removal or some other fault. I don't think you can use GHES alone to determine the nature of the event. There is not a 1:1 mapping from the set of things going wrong to the set of PCIe errors. > How does such an error look like, in detail? It's green on the soft side, with lots of red accents, as well as some textured white shades: [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able to correct [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up [ 52.711616] {1}[Hardware Error]: event severity: fatal [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal [ 52.721891] {1}[Hardware Error]: section_type: PCIe error [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port [ 52.734075] {1}[Hardware Error]: version: 3.0 [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010 [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0 [ 52.750271] {1}[Hardware Error]: slot: 4 [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3 [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733 [ 52.766123] {1}[Hardware Error]: class_code: 000406 [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003 [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask: 0x01a10000 [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0 [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer, aer_agent=Requester ID [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030 [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f e12023bc 01000000 [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message [ 52.883895] pci 0000:b3:00.0: device has no driver [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event queued; currently getting powered on [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up > Got error logs somewhere to dump? Sure [1]. They have the ANSI sequences, so you might want to wget and grep them in a color terminal. Alex [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180416-1919.log -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html