On 8/30/2017 11:16 AM, Borislav Petkov wrote: > On Wed, Aug 30, 2017 at 10:05:44AM -0400, Sinan Kaya wrote: >> Link reset is not the only recovery mechanism. In the case of nonfatal >> errors, it is assumed that the endpoint CSR is still reachable. >> Error is propagated the PCIe endpoint driver. Endpoint driver does a >> re-initialization, we are back in business. > > I'm assuming that's broadcast_error_message()'s job. > That's right. Each driver provides an err_handler hook. broadcast function calls these. static struct pci_driver e1000_driver = { .. .err_handler = &e1000_err_handler }; struct pci_error_handlers { ... pci_ers_result_t (*error_detected)(struct pci_dev *dev, enum pci_channel_state error); } >> That's not true. The GHES code is changing the severity here before posting >> to the AER driver in ghes_do_proc(). >> >> if (gdata->flags & CPER_SEC_RESET) >> aer_severity = AER_FATAL; > > You're missing the point that we would walk into that if branch *only* for > > if (sev == GHES_SEV_RECOVERABLE && > sec_sev == GHES_SEV_RECOVERABLE > > severities. So if you have an AER_FATAL error but ghes severities are > not GHES_SEV_RECOVERABLE, nothing happens. I see. We should probably try to do something only if GHES_SEV_CORRECTED or GHES_SEV_RECOVERABLE. If somebody wants to crash the system with GHES_SEV_PANIC, there is no point in doing additional work. > >> No, AER ISR is not set up if firmware first is enabled. > > So then this is a major suckage. We do AER recovery on FF systems only > for GHES_SEV_RECOVERABLE severity. > >> The behavior should match non firmware-first case ideally. >> >> 1. Print all correctable errors. >> 2. Go to do_recovery for all uncorrectable errors including fatal and >> non-fatal. >> >> This is also what AER driver does in the absence of firmware first via >> handle_error_source(). > > Yes, that makes sense. > > Which would mean that we'd call aer_recover_queue() regardless of GHES > severity but we'd do recovery only if GHES_SEV_RECOVERABLE is set > or CPER_SEC_RESET. I.e., we can communicate all that by setting the > correct AER severity before calling aer_recover_queue(). And then call > do_recovery() based on AER severity. > > Hmmm? > Sounds good. Do you still want to do PCIe recovery in the case of GHES_SEV_PANIC or if some FW returns GHES_SEV_NO? -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.