Hi Alexandru, On 03/04/18 18:08, Alexandru Gagniuc wrote: > BIOSes like to send NMIs for a number of silly reasons often deemed > to be "fatal". For example pin bounce during a PCIE hotplug/unplug > might cause the link to go down and retrain, with fatal PCI errors > being generated while the link is retraining. Sounds fun! > Instead of panic()ing in NMI context, pass fatal errors down to IRQ > context to see if they can be resolved. How do we know we will survive this trip? On arm64 systems it may not be possible to return to the context we took the NMI notification from: we may bounce back into firmware with the same "world is on fire" error. Firmware can rightly assume the OS has made no attempt to handle the error. Your 'not re-arming the error' example makes this worrying. > With these change, PCIe error are handled by AER. Other far less > common errors, such as machine check exceptions, still cause a panic() > in their respective handlers. I agree AER is always going to be different. Could we take a look at the CPER records while still in_nmi() to decide whether linux knows better than firmware? For non-standard or processor-errors I think we should always panic() if they're marked as fatal. For memory-errors we could split memory_failure() up to have {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether the affected memory is kernel memory. For the PCI*/AER errors we should be able to try and handle it ... if we can get back to process/IRQ context: What happens if a PCIe driver has interrupts masked and is doing something to cause this error? We can take the NMI and setup the irq-work, but it may never run as we return to interrupts-masked context. > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 2c998125b1d5..7243a99ea57e 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes) > static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) > { > struct ghes *ghes; > - int sev, ret = NMI_DONE; > + int ret = NMI_DONE; > > if (!atomic_add_unless(&ghes_in_nmi, 1, 1)) > return ret; > @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) > ret = NMI_HANDLED; > } > > - sev = ghes_severity(ghes->estatus->error_severity); > - if (sev >= GHES_SEV_PANIC) { > - oops_begin(); > - ghes_print_queued_estatus(); > - __ghes_panic(ghes); > - } > - > if (!(ghes->flags & GHES_TO_CLEAR)) > continue; For Processor-errors I think this is the wrong thing to do, but we should be able to poke around in the CPER records and find out what we are dealing with. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html