Hi Hawa, On 17/06/2019 14:00, Hawa, Hanna wrote: >> I don't think it can, on a second reading, it looks to be even more complicated than I >> thought! That bit is described as disabling forwarding of uncorrected data, but it looks >> like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush' >> means in this context.) >> I was looking for reasons you could 'know' that any reported error was corrected. This was >> just a bad suggestion! > Is there interrupt for un-correctable error? The answer here is somewhere between 'not really' and 'maybe'. There is a signal you may have wired-up as an interrupt, but its not usable from linux. A.8.2 "Asychronous error signals" of the A57 TRM [0] has: | nINTERRIRQ output Error indicator for an L2 RAM double-bit ECC error. ("7.6 Asynchronous errors" has more on this). Errors cause L2ECTLR[30] to get set, and this value output as a signal, you may have wired it up as an interrupt. If you did, beware its level sensitive, and can only be cleared by writing to L2ECTLR_EL1. You shouldn't allow linux to access this register as it could mess with the L2 configuration, which could also affect your EL3 and any secure-world software. The arrival of this interrupt doesn't tell you which L2 tripped the error, and you can only clear it if you write to L2ECTLR_EL1 on a CPU attached to the right L2. So this isn't actually a shared (peripheral) interrupt. This stuff is expected to be used by firmware, which can know the affinity constraints of signals coming in as interrupts. > Does 'asynchronous errors' in L2 used to report UE? >From "7.2.4 Error correction code" single-bit errors are always corrected. A.8.2 quoted above gives the behaviour for double-bit errors. > In case no interrupt, can we use die-notifier subsystem to check if any error had occur > while system shutdown? notify_die() would imply a synchronous exception that killed a thread. SError are a whole lot worse. Before v8.2 these are all treated as 'uncontained': unknown memory corruption. Which in your L2 case is exactly what happened. The arch code will panic(). If your driver can print something useful to help debug the panic(), then a panic_notifier sounds appropriate. But you can't rely on these notifiers being called, as kdump has some hooks that affect if/when they run. (KVM will 'contain' SError that come from a guest to the guest, as we know a distinct set of memory was in use. You may see fatal error counters increasing without the system panic()ing) contained/uncontained is part of the terminology from the v8.2 RAS spec [1]. Thanks, James [0] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0488c/DDI0488C_cortex_a57_mpcore_r1p0_trm.pdf [1] https://static.docs.arm.com/ddi0587/ca/ARM_DDI_0587C_a_RAS.pdf?_ga=2.148234679.1686960568.1560964184-897392434.1556719556