Hi Wladislav, On 09/01/2019 14:44, Wiebe, Wladislav (Nokia - DE/Ulm) wrote: >> From: James Morse <james.morse@xxxxxxx> >> Sent: Tuesday, January 08, 2019 6:57 PM >> On 08/01/2019 10:42, Borislav Petkov wrote: >>> So the first thing to figure out here is how generic is this and if >>> so, to make it a cortex_a15_edac.c driver which contains all the RAS >>> functionality for A15. Definitely not an EDAC driver per functional >>> unit but rather per vendor or even ARM core. >> >> This is implementation-defined/specific-to-A15 and is documented in the >> TRM [0]. >> (On the 'all the RAS functionality for A15' front: there are two more registers: >> L2MERRSR and CPUMERRSR. These are both accessible from the normal- >> world, and don't appear to need enabling.) After I sent this it occurred to me the core can't know about errors in the L3 cache (if there is one) or the memory-controller. These may have edac/ras abilities, but they are selected by the soc integrator, so could be per soc. This goes against Boris's no-per-functional-unit edac drivers. If we had to pick one out of that set, I think the memory-controller is most useful as DRAM is the most likely to be affected by errors. >> But we have the usual pre-v8.2 problems, and in addition cluster-interrupts, >> as this signal might be per-cluster, or it might be combined. >> >> Wladislav, I'm afraid we've had a few attempts at pre-8.2 EDAC drivers, the >> below list of problems is what we've learnt along the way. The upshot is that >> before the architected RAS extensions, the expectation is firmware will >> handle all this, as its difficult for the OS to deal with. >> >> >> My first question is how useful is a 'something bad happened' edac event? > > We experienced sometimes random user-space crashes where we didn't > expect a bug in the application code. If there would be a notification > by such edac event, Sure, but we always have to assume its the worst case: an uncontained error (to use the v8.2 terms). A write has gone somewhere it shouldn't, we can't trust memory anymore. > we would at least know that something bad happened before. >>> On Tue, Jan 08, 2019 at 08:10:45AM +0000, Wiebe, Wladislav (Nokia - >> DE/Ulm) wrote: >>>> This driver adds support for L2 internal asynchronous error detection >>>> caused by L2 RAM double-bit ECC error or illegal writes to the >>>> Interrupt Controller memory-map region on the Cortex A15. >> >>>> diff --git a/drivers/edac/cortex_a15_l2_async_edac.c >>>> b/drivers/edac/cortex_a15_l2_async_edac.c >>>> new file mode 100644 >>>> index 000000000000..26252568e961 >>>> --- /dev/null >>>> +++ b/drivers/edac/cortex_a15_l2_async_edac.c >>>> @@ -0,0 +1,134 @@ >>>> +static int cortex_a15_l2_async_edac_probe(struct platform_device >>>> +*pdev) { >>>> + struct edac_device_ctl_info *dci; >>>> + struct device_node *np = pdev->dev.of_node; >>>> + char *ctl_name = (char *)np->name; >>>> + int i = 0, ret = 0, err_irq = 0, irq_count = 0; >>>> + >>>> + /* We can have multiple CPU clusters with one INTERRIRQ per cluster >>>> +*/ >> >> Surely this an integration choice? >> >> You're accessing the cluster through a cpu register in the handler, what >> happens if the interrupt is delivered to the wrong cluster? >> How do we know which interrupt maps to which cluster? >> How do we stop user-space 'balancing' the interrupts? > > You are right, based on all your inputs I think we can stop using this driver > as generic A15 solution Handling this interrupt in firmware is probably the best for your soc. For a generic a15 driver in the kernel, we would have to consider 'no interrupt', (e.g. the interrupt is wired to some other SCP/BMC thing). Once we've got polling code for these registers, we may as well always use it. Thanks, James