Re: [PATCH 2/3] drivers/edac: Add L1 and L2 error detection for A53 and A57

James Morse <james.morse@xxxxxxx> · Fri, 6 Nov 2020 19:34:07 +0000

Hi Sascha,

On 13/10/2020 13:50, Sascha Hauer wrote:
> The Cortex A53 and A57 cores have error detection capabilities for the
> L1/L2 Caches, this patch adds a driver for them.
> 
> Unfortunately there is no robust way to inject errors into the caches,
> so this driver doesn't contain any code to actually test it. It has
> been tested though with code taken from an older version of this driver
> found here: https://lkml.org/lkml/2018/3/14/1203.

> For reasons stated
> in this thread the error injection code is not suitable for mainline,
> so it is removed from the driver.

> diff --git a/drivers/edac/cortex_arm64_l1_l2.c b/drivers/edac/cortex_arm64_l1_l2.c
> new file mode 100644
> index 000000000000..fb8386eb40ac
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_l1_l2.c
> @@ -0,0 +1,208 @@

> +static void read_errors(void *data)
> +{
> +	struct edac_device_ctl_info *edac_ctl = data;
> +	int cpu = smp_processor_id();
> +	char msg[MESSAGE_SIZE];
> +	u64 cpumerr, l2merr;
> +
> +	/* cpumerrsr_el1 */
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (cpumerr));
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (0));

I think you've seen earlier comments on using the sys_reg macros for this. There were
versions of binutils out there that choke on this.

[...]

> +}
> +
> +static void cortex_arm64_edac_check(struct edac_device_ctl_info *edac_ctl)
> +{
> +	struct arm64_pvt *pvt = edac_ctl->pvt_info;
> +	call_single_data_t *csd;
> +	int cpu;
> +
> +	get_online_cpus();
> +	for_each_cpu_and(cpu, cpu_online_mask, &pvt->compat_mask) {
> +		csd = per_cpu_ptr(pvt->csd_check, cpu);
> +		csd->func = read_errors;
> +		csd->info = edac_ctl;
> +		csd->flags = 0;

> +		/* Read CPU L1/L2 errors */
> +		smp_call_function_single_async(cpu, csd);
> +		/* Wait until flags cleared */
> +		smp_cond_load_acquire(&csd->flags, !VAL);

Hmm. We end up waiting for each CPU to schedule something else. I can't see any reason we
can't sleep here.

Can't we use smp_call_function_many() here? It already considers cpu_online_mask, you'd
just need to deal with read_errors() being called in parallel with itself.

(concurrent calls into edac are one problem, but two CPUs read/writing the same L2
register could lead to double counting)

> +	}
> +	put_online_cpus();
> +}

> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	struct device_node *np, *dn = pdev->dev.of_node;
> +	struct edac_device_ctl_info *edac_ctl;
> +	struct device *dev = &pdev->dev;
> +	struct of_phandle_iterator it;
> +	struct arm64_pvt *pvt;
> +	int rc, cpu;
> +
> +	edac_ctl = edac_device_alloc_ctl_info(sizeof(*pvt), "cpu_cache",
> +					      1, "L", 2, 1, NULL, 0,
> +					      edac_device_alloc_index());

I used this series to test on Juno to poke the user-space interface:
This chokes on a big-little system as it can't register "cpu_cache" a second time.

I think we should try to make the topology look like the one in edac_device.h. This means
calling it 'cpu', and registering all of them up front.
On a big/little system the second probe() call would need to be careful.

I can have a go at this if you don't have a platform to hand.

(The 'L2-cache' thing in edac_device.h turns out to be impossible and the 'Lx' you've done
here is the most popular option. I'll post a patch to change the documentation to what
people are doing)

[...]

> +}

Thanks,

James