On Tue, Jun 15, 2021 at 5:47 AM Xiaofei Tan <tanxiaofei@xxxxxxxxxx> wrote: > > Hi Rafael, > > On 2021/6/14 23:46, Rafael J. Wysocki wrote: > > On Fri, Jun 11, 2021 at 2:40 PM Xiaofei Tan <tanxiaofei@xxxxxxxxxx> wrote: > >> > >> Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea() > >> synchronise with APEI's irq work"), do_sea() would unconditionally > >> signal the affected task from the arch code. Since that change, > >> the GHES driver sends the signals. > >> > >> This exposes a problem as errors the GHES driver doesn't understand > >> or doesn't handle effectively are silently ignored. It will cause > >> the errors get taken again, and circulate endlessly. User-space task > >> get stuck in this loop. > >> > >> Existing firmware on Kunpeng9xx systems reports cache errors with the > >> 'ARM Processor Error' CPER records. > >> > >> Do memory failure handling for ARM Processor Error Section just like > >> for Memory Error Section. > > > > Still, I'm not convinced that this is the right way to address the problem. > > > > In particular, is it guaranteed that "ARM Processor Error" will always > > mean "memory failure" on all platforms? > > > > There are two sources for ARM Processor cache errors(no second case for the platform that doesn't support poison mechanism). > 1.occur in the cache. If it is transient, we have a chance to recover by doing memory failure. > If it is persistent, we have to handle in other place, such as do cache way isolation in firmware, > or trigger cpu core isolation in user space. I think most platform can't support such feature, > so the most simple and effective way is report as fatal error and do isolation during firmware start-up phase. > > 2.error transferred from other RAS node. If it is from DDR, i think there is no doubt, and this is > the most cases we met before.If it is from other place of SoC, such as internal SRAM(the probability is very little compare to DDR), > the error is still in the hardware. But the RAS node that detected the SRAM error will also report the error. > > To sum up the above, it is effective for most situation, and no harm for the others. OK, so applied as 5.14 material under edited subject. Thanks!