Hi Dave, Thanks for going through all these, On 15/01/18 16:30, Dave Martin wrote: > On Thu, Jan 11, 2018 at 06:59:36PM -0600, Eric W. Biederman wrote: >> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c >> index 9b7f89df49db..abe200587334 100644 >> --- a/arch/arm64/mm/fault.c >> +++ b/arch/arm64/mm/fault.c >> @@ -596,7 +596,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs) >> + { do_sea, SIGBUS, BUS_FIXME, "synchronous external abort" }, > > This si_code seems to be a fallback for if ACPI is absent or doesn't > know what to do with this error. > > -> SIGBUS/BUS_OBJERR? > > Can probably legitimately happen for userspace for suitable MMIO mappings. It can happen for normal memory too, there are specific ESR values for parity/checksum errors when read/writing memory. I think this first one is 'other/unknown', and its up to the CPU how to classify them. > Perhaps it's more serious though in the presence of ACPI. Do we expect > that ACPI can diagnose all localisable errors? Its not just ACPI, the CPU's v8.2 RAS Extensions use this synchronous-external-abort as notification of a RAS error, (the other details are written to to memory-mapped nodes). With the v8.2 RAS Extensions the ESR tells us if the error was contained. For ACPI we rely on firmware to set an appropriate severity in the CPER records generated by firmware. The APEI helpers will call panic() if they find a fatal error. For systems with neither {firmware,kernel}-first RAS, BUS_OBJERR looks like a good choice. >> + { do_sea, SIGBUS, BUS_FIXME, "level 0 (translation table walk)" }, >> + { do_sea, SIGBUS, BUS_FIXME, "level 1 (translation table walk)" }, >> + { do_sea, SIGBUS, BUS_FIXME, "level 2 (translation table walk)" }, >> + { do_sea, SIGBUS, BUS_FIXME, "level 3 (translation table walk)" }, > > Pagetable screwup or kernel/system/CPU bug -> SIGKILL, or panic(). (RAS mechanisms may claim this and send their own signals, if not:) SIGKILL is probably a better choice here, while we do have an address, there is nothing user-space can do about it. >> + { do_sea, SIGBUS, BUS_FIXME, "synchronous parity or ECC error" }, // Reserved when RAS is implemented > > Possibly SIGBUS/BUS_MCEERR_AR (though I don't know exactly what > userspace is supposed to do with this or whether this implies the > existence or certain kernel features for managing the error that > may not be present on arm64...) I'd like to keep the MCEERR signals to errors that we know are contained, the kernel has understood and handled. (These features do exist for arm64, enabling CONFIG_MEMORY_FAILURE and a few APEI options allows all this to work today with suitable firmware. My Seattle claims to support it). > Otherwise, SIGKILL. Sounds good, >> + { do_sea, SIGBUS, BUS_FIXME, "level 0 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented >> + { do_sea, SIGBUS, BUS_FIXME, "level 1 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented >> + { do_sea, SIGBUS, BUS_FIXME, "level 2 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented >> + { do_sea, SIGBUS, BUS_FIXME, "level 3 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented > > Process page tables corrupt: if the kernel couldn't fix this, the > process can't reasonably fix it -> SIGKILL > > Since this is a RAS-type error it could be triggered by a cosmic ray > rather than requiring a kernel or system bug or other major failure, so > we probably shouldn't panic the system if the error is localisable to a > particular process. Without the RAS-Extensions severity to tell us the error is contained I'm not sure what we can expect. But given the page-tables are per-process, and we never swap them to disk etc, its probably a safe bet that it doesn't matter either way for these. Thanks, James