Re: [PATCH v7] ACPI / APEI: fix the regression of synchronous external aborts occur in user-mode

"Rafael J. Wysocki" <rafael@xxxxxxxxxx> · Thu, 17 Jun 2021 14:07:01 +0200

On Tue, Jun 15, 2021 at 5:47 AM Xiaofei Tan <tanxiaofei@xxxxxxxxxx> wrote:
>
> Hi Rafael,
>
> On 2021/6/14 23:46, Rafael J. Wysocki wrote:
> > On Fri, Jun 11, 2021 at 2:40 PM Xiaofei Tan <tanxiaofei@xxxxxxxxxx> wrote:
> >>
> >> Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea()
> >> synchronise with APEI's irq work"), do_sea() would unconditionally
> >> signal the affected task from the arch code. Since that change,
> >> the GHES driver sends the signals.
> >>
> >> This exposes a problem as errors the GHES driver doesn't understand
> >> or doesn't handle effectively are silently ignored. It will cause
> >> the errors get taken again, and circulate endlessly. User-space task
> >> get stuck in this loop.
> >>
> >> Existing firmware on Kunpeng9xx systems reports cache errors with the
> >> 'ARM Processor Error' CPER records.
> >>
> >> Do memory failure handling for ARM Processor Error Section just like
> >> for Memory Error Section.
> >
> > Still, I'm not convinced that this is the right way to address the problem.
> >
> > In particular, is it guaranteed that "ARM Processor Error" will always
> > mean "memory failure" on all platforms?
> >
>
> There are two sources for ARM Processor cache errors(no second case for the platform that doesn't support poison mechanism).
> 1.occur in the cache. If it is transient, we have a chance to recover by doing memory failure.
> If it is persistent, we have to handle in other place, such as do cache way isolation in firmware,
> or trigger cpu core isolation in user space. I think most platform can't support such feature,
> so the most simple and effective way is report as fatal error and do isolation during firmware start-up phase.
>
> 2.error transferred from other RAS node. If it is from DDR, i think there is no doubt, and this is
> the most cases we met before.If it is from other place of SoC, such as internal SRAM(the probability is very little compare to DDR),
> the error is still in the hardware. But the RAS node that detected the SRAM error will also report the error.
>
> To sum up the above, it is effective for most situation, and no harm for the others.

OK, so applied as 5.14 material under edited subject.

Thanks!