Hi James, first of all thanks a lot for the constructive and fast feedback! > -----Original Message----- > From: James Morse <james.morse@xxxxxxx> > Sent: Tuesday, January 08, 2019 6:57 PM > > Hi Boris, Wladislav, > > On 08/01/2019 10:42, Borislav Petkov wrote: > > + James and leaving in the rest for reference. > > (thanks!) > > > So the first thing to figure out here is how generic is this and if > > so, to make it a cortex_a15_edac.c driver which contains all the RAS > > functionality for A15. Definitely not an EDAC driver per functional > > unit but rather per vendor or even ARM core. > > This is implementation-defined/specific-to-A15 and is documented in the > TRM [0]. > (On the 'all the RAS functionality for A15' front: there are two more registers: > L2MERRSR and CPUMERRSR. These are both accessible from the normal- > world, and don't appear to need enabling.) > > > But we have the usual pre-v8.2 problems, and in addition cluster-interrupts, > as this signal might be per-cluster, or it might be combined. > > Wladislav, I'm afraid we've had a few attempts at pre-8.2 EDAC drivers, the > below list of problems is what we've learnt along the way. The upshot is that > before the architected RAS extensions, the expectation is firmware will > handle all this, as its difficult for the OS to deal with. > > > My first question is how useful is a 'something bad happened' edac event? We experienced sometimes random user-space crashes where we didn't expect a bug in the application code. If there would be a notification by such edac event, we would at least know that something bad happened before. > Before the v8.2 extensions with its classification of errors, we don't know > anything more. > > The usual suspects are, (partly taken from the thread at [1]): > * A15 exists in big/little configurations. We need to know which CPUs are > A15. > * We need to know we aren't running under a hypervisor, (a hypervisor can > trap > accesses to these imp-def register, KVM does). > * Nothing else should be clearing these bits, e.g. secure-world software, or > another CPU. > * Secure-world needs to enable write-access to L2ECTLR, and we need to > know its done it. This needs doing on every CPU, and needs to not 'go > missing' > over cpu-hotplug or cpu-idle. > > These are things that don't naturally live in the DT. > > > The new-one is these cluster-interrupts: How do we know which set of CPUs > each interrupt goes with? What happens if user-space tries to rebalance > them? Valid question - so far, I didn't consider this case. > Another SoC with A15 may combine all the cluster-interrupts into a single > 'something bad happened' interrupt. Done like this, we would need to cross- > call to the other CPUs when we take an interrupt - which is not something we > can do. > > Is this a level or edge interrupt? Is it necessary to clear that bit in the register > to lower the interrupt line? > The TRM talks about 'pending L2 internal asynchronous error', pending > makes me suspect this is at least possible. If it is, a level-interrupt to one > CPU, that can only be cleared by another leads to deadlock. > > > Thanks, > > James > > > On Tue, Jan 08, 2019 at 08:10:45AM +0000, Wiebe, Wladislav (Nokia - > DE/Ulm) wrote: > >> This driver adds support for L2 internal asynchronous error detection > >> caused by L2 RAM double-bit ECC error or illegal writes to the > >> Interrupt Controller memory-map region on the Cortex A15. > > >> diff --git a/drivers/edac/cortex_a15_l2_async_edac.c > >> b/drivers/edac/cortex_a15_l2_async_edac.c > >> new file mode 100644 > >> index 000000000000..26252568e961 > >> --- /dev/null > >> +++ b/drivers/edac/cortex_a15_l2_async_edac.c > >> @@ -0,0 +1,134 @@ > >> +// SPDX-License-Identifier: GPL-2.0 > >> +/* > >> + * Copyright (C) 2018 Nokia Corporation > > (boiler plate not needed with the SPDX header) > > >> + */ > >> + > >> +#include <linux/module.h> > >> +#include <linux/interrupt.h> > >> +#include <linux/platform_device.h> > >> +#include <linux/of.h> > >> + > >> +#include "edac_module.h" > >> + > >> +#define DRIVER_NAME "cortex_a15_l2_async_edac" > >> + > >> +#define L2ECTLR_L2_ASYNC_ERR BIT(30) > >> + > >> +static irqreturn_t cortex_a15_l2_async_edac_err_handler(int irq, > >> +void *dev_id) { > >> + struct edac_device_ctl_info *dci = dev_id; > >> + u32 status = 0; > >> + > >> + /* > >> + * Read and clear L2ECTLR L2 ASYNC error bit caused by INTERRIRQ. > >> + * Reason could be a L2 RAM double-bit ECC error or illegal writes > >> + * to the Interrupt Controller memory-map region. > >> + */ > >> + asm("mrc p15, 1, %0, c9, c0, 3" : "=r" (status)); > > "L2 internal asynchronous error caused by L2 RAM double-bit ECC error" > doesn't tell us if a CPU consumed the error, or if the error has caused a write > to go missing. Without the classification, this means 'something bad > happened'. > > I'd prefer to panic() when we see one of these. I'd like it even more if > firmware rebooted for us. The EDAC subsystem allows to configure a panic() from userspace/sysfs. So we can be flexible at this point I think. > > >> + if (status & L2ECTLR_L2_ASYNC_ERR) { > >> + status &= ~L2ECTLR_L2_ASYNC_ERR; > >> + asm("mcr p15, 1, %0, c9, c0, 3" : : "r" (status)); > > 4.3.49 "L2 Extended Control Register" of the A15 TRM says this field can be > read-only/write-ignored for the normal world if NSACR.NS_L2ERR is 0. > > How do we know if firmware has set this bit on all CPUs? We can't clear the > error otherwise. Valid point! > > >> + edac_printk(KERN_EMERG, DRIVER_NAME, > >> + "L2 internal asynchronous error occurred!\n"); > >> + edac_device_handle_ue(dci, 0, 0, dci->ctl_name); > > >> + > >> + return IRQ_HANDLED; > >> + } > >> + > >> + return IRQ_NONE; > >> +} > >> + > >> +static int cortex_a15_l2_async_edac_probe(struct platform_device > >> +*pdev) { > >> + struct edac_device_ctl_info *dci; > >> + struct device_node *np = pdev->dev.of_node; > >> + char *ctl_name = (char *)np->name; > >> + int i = 0, ret = 0, err_irq = 0, irq_count = 0; > >> + > >> + /* We can have multiple CPU clusters with one INTERRIRQ per cluster > >> +*/ > > Surely this an integration choice? > > You're accessing the cluster through a cpu register in the handler, what > happens if the interrupt is delivered to the wrong cluster? > How do we know which interrupt maps to which cluster? > How do we stop user-space 'balancing' the interrupts? You are right, based on all your inputs I think we can stop using this driver as generic A15 solution (at least I would need more time to do the refactoring considering all points you stated and experienced already). Thanks a lot! - Wladislav