RE: [PATCH v6 2/2] EDAC/zynqmp: Add EDAC support for Xilinx ZynqMP OCM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Boris,

> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Friday, November 25, 2022 8:42 PM
> To: Potthuri, Sai Krishna <sai.krishna.potthuri@xxxxxxx>
> Cc: Rob Herring <robh+dt@xxxxxxxxxx>; Krzysztof Kozlowski
> <krzysztof.kozlowski+dt@xxxxxxxxxx>; Michal Simek
> <michal.simek@xxxxxxxxxx>; Mauro Carvalho Chehab
> <mchehab@xxxxxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>; James Morse
> <james.morse@xxxxxxx>; Robert Richter <rric@xxxxxxxxxx>;
> devicetree@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; linux-edac@xxxxxxxxxxxxxxx;
> saikrishna12468@xxxxxxxxx; git (AMD-Xilinx) <git@xxxxxxx>; Datta,
> Shubhrajyoti <shubhrajyoti.datta@xxxxxxx>; kernel test robot
> <lkp@xxxxxxxxx>
> Subject: Re: [PATCH v6 2/2] EDAC/zynqmp: Add EDAC support for Xilinx
> ZynqMP OCM
> 
> On Wed, Nov 02, 2022 at 12:36:55PM +0530, Sai Krishna Potthuri wrote:
> > Add EDAC support for Xilinx ZynqMP OCM Controller,
> 
> So this:
> 
> > this driver reports CE and UE errors upon interrupt generation, and
> > also creates UE/CE debugfs entries for error injection when EDAC_DEBUG
> > config is enabled.
> 
> I can see in the patch itself.
> 
> Do not talk about what your patch does - that should hopefully be visible
> from the diff. Rather, talk about *why* you're doing what you're doing.
> 
> Like, for example, you can explain here how this driver is going to co-exist
> with the other EDAC driver, i.e., the Synopsys one or the DDRMC.
Ok, will update.

> 
> > Co-developed-by: Shubhrajyoti Datta <shubhrajyoti.datta@xxxxxxx>
> > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xxxxxxx>
> > Signed-off-by: Sai Krishna Potthuri <sai.krishna.potthuri@xxxxxxx>
> > Reported-by: kernel test robot <lkp@xxxxxxxxx>
> 
> What exactly was reported by the robot?
> 
> Pls put that in the commit message as
> 
> "Fix issue <BLA> as reported by the robot."
> 
> so that it is clear what that Reported-by: refers to.
Ok, will update.

> 
> > ---
> >  MAINTAINERS                    |   7 +
> >  drivers/edac/Kconfig           |   9 +
> >  drivers/edac/Makefile          |   1 +
> >  drivers/edac/zynqmp_ocm_edac.c | 485
> > +++++++++++++++++++++++++++++++++
> >  4 files changed, 502 insertions(+)
> >  create mode 100644 drivers/edac/zynqmp_ocm_edac.c
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS index
> > edc96cdb85e8..7a40caf536c2 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -21692,6 +21692,13 @@ F:
> 	Documentation/devicetree/bindings/dma/xilinx/xlnx,zynqmp-
> dpdma.yaml
> >  F:	drivers/dma/xilinx/xilinx_dpdma.c
> >  F:	include/dt-bindings/dma/xlnx-zynqmp-dpdma.h
> >
> > +XILINX ZYNQMP OCM EDAC DRIVER
> > +M:	Shubhrajyoti Datta <shubhrajyoti.datta@xxxxxxx>
> > +M:	Sai Krishna Potthuri <sai.krishna.potthuri@xxxxxxx>
> > +S:	Maintained
> > +F:	Documentation/devicetree/bindings/memory-
> controllers/xlnx,zynqmp-ocmc-1.0.yaml
> > +F:	drivers/edac/zynqmp_ocm_edac.c
> > +
> >  XILINX ZYNQMP PSGTR PHY DRIVER
> >  M:	Anurag Kumar Vulisha <anurag.kumar.vulisha@xxxxxxxxxx>
> >  M:	Laurent Pinchart <laurent.pinchart@xxxxxxxxxxxxxxxx>
> > diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index
> > 58ab63642e72..bc30b7d02062 100644
> > --- a/drivers/edac/Kconfig
> > +++ b/drivers/edac/Kconfig
> > @@ -539,4 +539,13 @@ config EDAC_DMC520
> >  	  Support for error detection and correction on the
> >  	  SoCs with ARM DMC-520 DRAM controller.
> >
> > +config EDAC_ZYNQMP_OCM
> > +	tristate "Xilinx ZynqMP OCM Controller"
> > +	depends on ARCH_ZYNQMP || COMPILE_TEST
> > +	help
> > +	  This driver is targeted for Xilinx ZynqMP OCM (On Chip Memory)
> 
> "This driver supports ...."
Ok, will update.

> 
> > +	  controller to support for error detection and correction.
> > +	  This driver can also be built as a module. If so, the module
> > +	  will be called zynqmp_ocm_edac.
> > +
> >  endif # EDAC
> > diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index
> > 2d1641a27a28..634c1cec1588 100644
> > --- a/drivers/edac/Makefile
> > +++ b/drivers/edac/Makefile
> > @@ -84,3 +84,4 @@ obj-$(CONFIG_EDAC_QCOM)			+=
> qcom_edac.o
> >  obj-$(CONFIG_EDAC_ASPEED)		+= aspeed_edac.o
> >  obj-$(CONFIG_EDAC_BLUEFIELD)		+= bluefield_edac.o
> >  obj-$(CONFIG_EDAC_DMC520)		+= dmc520_edac.o
> > +obj-$(CONFIG_EDAC_ZYNQMP_OCM)		+=
> zynqmp_ocm_edac.o
> 
> Is there going to be another ZynqMP EDAC driver?
> 
> If not, this one could be simply called zynqmp_edac.c or?
As we communicated earlier for ZynqMP platform we have both Synopsys (for DDRMC)
and zynqmp_ocm_edac (for OCM) drivers.
Just to be clear about what this driver is targeted for, we used ocm as part of file name.
Ok, zynqmp_edac.c looks fine, will update.

> 
> > diff --git a/drivers/edac/zynqmp_ocm_edac.c
> > b/drivers/edac/zynqmp_ocm_edac.c new file mode 100644 index
> > 000000000000..32f025b72380
> > --- /dev/null
> > +++ b/drivers/edac/zynqmp_ocm_edac.c
> 
> > +/* Interrupt masks */
> > +#define OCM_CEINTR_MASK		BIT(6)
> > +#define OCM_UEINTR_MASK		BIT(7)
> > +#define OCM_ECC_ENABLE_MASK	BIT(0)
> > +#define OCM_CEUE_MASK		GENMASK(7, 6)
> 
> Drop that one and use
> 
> 	OCM_CEINTR_MASK | OCM_UEINTR_MASK
> 
> everywhere.
Ok, will update.

> 
> > +#define OCM_FICOUNT_MASK	GENMASK(23, 0)
> > +#define OCM_NUM_UE_BITPOS	2
> > +#define OCM_BASEVAL		0xFFFC0000
> > +#define EDAC_DEVICE		"ZynqMP-OCM"
> > +
> > +/**
> > + * struct ecc_error_info - ECC error log information
> > + * @addr:	Fault generated at this address
> > + * @data0:	Generated fault data (lower 32-bit)
> > + * @data1:	Generated fault data (upper 32-bit)
> > + */
> > +struct ecc_error_info {
> > +	u32 addr;
> > +	u32 data0;
> > +	u32 data1;
> 
> What's wrong with calling those fault_lo and fault_hi then?
Ok, will update.

> 
> > +/**
> > + * get_error_info - Get the current ECC error info
> > + * @base:	Pointer to the base address of the OCM memory controller
> > + * @p:		Pointer to the OCM ECC status structure
> > + * @mask:	Status register mask value
> > + *
> > + * Determines there is any ECC error or not
> > + *
> > + */
> > +static void get_error_info(void __iomem *base, struct ecc_status *p,
> > +int mask) {
> > +	if (mask & OCM_CEINTR_MASK) {
> > +		p->ce_cnt++;
> > +		p->ceinfo.data0 = readl(base + CE_FFD0_OFST);
> > +		p->ceinfo.data1 = readl(base + CE_FFD1_OFST);
> > +		p->ceinfo.addr = (OCM_BASEVAL | readl(base +
> CE_FFA_OFST));
> > +		writel(ECC_CTRL_CLR_CE_ERR, base + OCM_ISR_OFST);
> > +	} else {
> 
> I guess you need to check OCM_UEINTR_MASK here?
Since we are dealing other interrupts in intr_handler(), this can be simple else.

> 
> > +		p->ue_cnt++;
> > +		p->ueinfo.data0 = readl(base + UE_FFD0_OFST);
> > +		p->ueinfo.data1 = readl(base + UE_FFD1_OFST);
> > +		p->ueinfo.addr = (OCM_BASEVAL | readl(base +
> UE_FFA_OFST));
> > +		writel(ECC_CTRL_CLR_UE_ERR, base + OCM_ISR_OFST);
> > +	}
> 
> No, I didn't mean for you to drop that block. See comment in
> intr_handler() below.
Ok, will handle the warning in intr_handler() if it raises for other interrupts.

> 
> > +/**
> > + * handle_error - Handle controller error types CE and UE
> > + * @dci:	Pointer to the EDAC device controller instance
> > + * @p:		Pointer to the OCM ECC status structure
> > + *
> > + * Handles the controller ECC correctable and uncorrectable error.
> 
> s/controller// - we know it is the controller. You probably should go through
> all text in here and tone down all the "controller" mentions.
Ok, will update.

> 
> > + */
> > +static void handle_error(struct edac_device_ctl_info *dci, struct
> > +ecc_status *p) {
> > +	struct edac_priv *priv = dci->pvt_info;
> > +	struct ecc_error_info *pinf;
> > +
> > +	if (p->ce_cnt) {
> > +		pinf = &p->ceinfo;
> > +		snprintf(priv->message, ZYNQMP_OCM_EDAC_MSG_SIZE,
> > +			 "\nOCM ECC error type :%s\nAddr: [0x%x]\nFault
> Data[0x%08x%08x]",
> > +			 "CE", pinf->addr, pinf->data1, pinf->data0);
> > +		edac_device_handle_ce(dci, 0, 0, priv->message);
> > +	}
> > +
> > +	if (p->ue_cnt) {
> > +		pinf = &p->ueinfo;
> > +		snprintf(priv->message, ZYNQMP_OCM_EDAC_MSG_SIZE,
> > +			 "\nOCM ECC error type :%s\nAddr: [0x%x]\nFault
> Data[0x%08x%08x]",
> > +			 "UE", pinf->addr, pinf->data1, pinf->data0);
> > +		edac_device_handle_ue(dci, 0, 0, priv->message);
> > +	}
> > +
> > +	memset(p, 0, sizeof(*p));
> > +}
> > +
> > +/**
> > + * intr_handler - ISR routine
> > + * @irq:        irq number
> > + * @dev_id:     device id pointer
> > + *
> > + * Return: IRQ_NONE, if interrupt not set or IRQ_HANDLED otherwise
> > +*/ static irqreturn_t intr_handler(int irq, void *dev_id) {
> > +	struct edac_device_ctl_info *dci = dev_id;
> > +	struct edac_priv *priv = dci->pvt_info;
> > +	int regval;
> > +
> > +	regval = readl(priv->baseaddr + OCM_ISR_OFST);
> > +	if (!(regval & OCM_CEUE_MASK))
> > +		return IRQ_NONE;
> 
> What is that check for?
> 
> If neither of the masks are set but the device still raises an error interrupt,
> then you need to WARN_ONCE() here so that people look at this and debug it
> as to why it still raised an interrupt.
Ok, will update.

> 
> > +	get_error_info(priv->baseaddr, &priv->stat, regval);
> > +
> > +	priv->ce_cnt += priv->stat.ce_cnt;
> > +	priv->ue_cnt += priv->stat.ue_cnt;
> > +	handle_error(dci, &priv->stat);
> > +
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +/**
> > + * get_eccstate - Return the controller ECC status
> > + * @base:	Pointer to the OCM memory controller base address
> > + *
> > + * Get the ECC enable/disable status for the controller
> > + *
> > + * Return: ECC status 0/1.
> > + */
> > +static bool get_eccstate(void __iomem *base) {
> > +	return readl(base + ECC_CTRL_OFST) & OCM_ECC_ENABLE_MASK; }
> > +
> > +#ifdef CONFIG_EDAC_DEBUG
> > +/**
> > + * inject_fault_count - write fault injection count
> > + * @priv:	Pointer to the EDAC private struct
> > + *
> > + * Update the fault injection count register, once the counter
> > +reaches
> > + * zero, it injects errors
> > + */
> > +static void inject_fault_count(struct edac_priv *priv) {
> > +	u32 ficount = priv->fault_injection_cnt;
> > +
> > +	ficount &= OCM_FICOUNT_MASK;
> > +	if (ficount != priv->fault_injection_cnt)
> 
> Do this:
> 
> 	if (ficount & ~OCM_FICOUNT_MASK) {
> 		ficount &= OCM_FICOUNT_MASK;
> 		edac_printk(KERN_INFO, EDAC_DEVICE, "Fault injection
> count value truncated to: %d\n", ficount);
> 	}
> 
> i.e., mask it *only* when it is larger.
Ok, will update.

> 
> > +
> > +	writel(ficount, priv->baseaddr + OCM_FIC_OFST); }
> > +
> > +/**
> > + * inject_cebitpos - Set CE bit position
> > + * @priv:	Pointer to the EDAC private struct
> > + *
> > + * Set any one bit to inject CE error  */ static void
> > +inject_cebitpos(struct edac_priv *priv) {
> > +	if (priv->ce_bitpos <= UE_MAX_BITPOS_LOWER) {
> > +		writel(BIT(priv->ce_bitpos), priv->baseaddr +
> OCM_FID0_OFST);
> > +		writel(0, priv->baseaddr + OCM_FID1_OFST);
> > +	} else {
> > +		writel(BIT(priv->ce_bitpos - UE_MIN_BITPOS_UPPER),
> > +		       priv->baseaddr + OCM_FID1_OFST);
> > +		writel(0, priv->baseaddr + OCM_FID0_OFST);
> 
> I had to stare at this a bit to figure out what you're doing: the injection
> registers are two 32-bit quantities and depending on where you inject, you
> need to select into which offset to write it.
> 
> But looking more at this, all the proper range checks should happen in the
> debugfs callbacks, i.e., inject_ce_write() in this case.
> 
> The actual injection function should only inject - that's it.
> 
> And come to think of it, you don't need inject_cebitpos() or inject_uebitpos().
> 
> Your debugfs callbacks should:
> 
> 1. check the range, error out and print a warning if range wrong 2. inject
> otherwise.
> 
> and that's it.
Ok, will re-arrange the logic.

> 
> ...
> 
> > +static ssize_t inject_ue_write(struct file *file, const char __user *data,
> > +			       size_t count, loff_t *ppos) {
> > +	struct edac_device_ctl_info *edac_dev = file->private_data;
> > +	struct edac_priv *priv = edac_dev->pvt_info;
> > +	char buf[6];
> > +	u8 pos0, pos1, len;
> > +	int ret;
> 
> The EDAC tree preferred ordering of variable declarations at the beginning of
> a function is reverse fir tree order::
> 
> 	struct long_struct_name *descriptive_name;
> 	unsigned long foo, bar;
> 	unsigned int tmp;
> 	int ret;
> 
> The above is faster to parse than the reverse ordering::
> 
> 	int ret;
> 	unsigned int tmp;
> 	unsigned long foo, bar;
> 	struct long_struct_name *descriptive_name;
> 
> And even more so than random ordering::
> 
> 	unsigned long foo, bar;
> 	int ret;
> 	struct long_struct_name *descriptive_name;
> 	unsigned int tmp;
> 
> You're pretty much doing it but some functions' local vars still need re-
> sorting.
Ok, will update.

> 
> > +
> > +	if (!data)
> > +		return -EFAULT;
> > +
> > +	len = min_t(size_t, count, sizeof(buf));
> > +	if (copy_from_user(buf, data, len))
> > +		return -EFAULT;
> > +
> > +	for (pos0 = 0; buf[pos0] != ' ' && pos0 <= len; pos0++)
> > +		;
> > +
> > +	if (pos0 > len)
> > +		return -EINVAL;
> > +
> > +	ret = kstrtou8_from_user(data, pos0, 0, &priv->ue_bitpos[0]);
> > +	if (ret)
> > +		return ret;
> > +
> > +	for (pos1 = pos0 + 1; buf[pos1] != '\n' && pos1 <= len; pos1++)
> > +		;
> > +
> > +	if (pos1 > count)
> > +		return -EINVAL;
> > +
> > +	ret = kstrtou8_from_user(&data[pos0 + 1], pos1 - pos0 - 1, 0,
> > +				 &priv->ue_bitpos[1]);
> 
> This looks like it is composing multiple values. I guess the two bits for an UE
> since UE is a two-bit error.
> 
> No documentation?
> 
> IOW, you need to document how this injection works. In a comment here
> somewhere, pls explain what the user is supposed to do when she wants to
> inject errors.
> 
> Example:
> 
> Documentation/firmware-guide/acpi/apei/einj.rst
> 
> You don't have to do a separate file and go too much into detail but at least a
> simple injection recipe/example would be prudent to have.
Ok, will update API documentation like below.
echo <fault_injection count> > /sys/kernel/debug/edac/ff960000.memory-controller/inject_fault_count
echo <bit pos0> <bit pos1> > /sys/kernel/debug/edac/ff960000.memory-controller/inject_ue_bitpos

> 
> > +static int edac_probe(struct platform_device *pdev) {
> > +	struct edac_priv *priv;
> > +	struct edac_device_ctl_info *dci;
> > +	void __iomem *baseaddr;
> > +	struct resource *res;
> > +	int irq, ret;
> > +
> > +	baseaddr = devm_platform_get_and_ioremap_resource(pdev, 0,
> &res);
> > +	if (IS_ERR(baseaddr))
> > +		return PTR_ERR(baseaddr);
> > +
> > +	if (!get_eccstate(baseaddr)) {
> > +		edac_printk(KERN_INFO, EDAC_DEVICE,
> > +			    "ECC not enabled\n");
> 
> No need to break that line.
Ok, will update.

> 
> > +		return -ENXIO;
> > +	}
> > +
> > +	dci = edac_device_alloc_ctl_info(sizeof(*priv),
> ZYNQMP_OCM_EDAC_STRING,
> > +					 1, ZYNQMP_OCM_EDAC_STRING, 1,
> 0, NULL, 0,
> > +					 edac_device_alloc_index());
> > +	if (!dci)
> > +		return -ENOMEM;
> > +
> > +	priv = dci->pvt_info;
> > +	platform_set_drvdata(pdev, dci);
> > +	dci->dev = &pdev->dev;
> > +	priv->baseaddr = baseaddr;
> > +	dci->mod_name = pdev->dev.driver->name;
> > +	dci->ctl_name = ZYNQMP_OCM_EDAC_STRING;
> > +	dci->dev_name = dev_name(&pdev->dev);
> > +
> > +	irq = platform_get_irq(pdev, 0);
> > +	if (irq < 0) {
> > +		ret = irq;
> > +		goto free_dev_ctl;
> > +	}
> > +
> > +	ret = devm_request_irq(&pdev->dev, irq, intr_handler, 0,
> > +			       dev_name(&pdev->dev), dci);
> > +	if (ret) {
> > +		edac_printk(KERN_ERR, EDAC_DEVICE, "Failed to request
> Irq\n");
> > +		goto free_dev_ctl;
> > +	}
> > +
> > +	/* Enable UE, CE interrupts */
> > +	writel(OCM_CEUE_MASK, priv->baseaddr + OCM_IEN_OFST);
> > +
> > +#ifdef CONFIG_EDAC_DEBUG
> > +	setup_debugfs(dci);
> > +#endif
> 
> Do this instead:
> 
> diff --git a/drivers/edac/zynqmp_ocm_edac.c
> b/drivers/edac/zynqmp_ocm_edac.c index 32f025b72380..a2b8cf1eb986
> 100644
> --- a/drivers/edac/zynqmp_ocm_edac.c
> +++ b/drivers/edac/zynqmp_ocm_edac.c
> @@ -428,9 +428,8 @@ static int edac_probe(struct platform_device *pdev)
>  	/* Enable UE, CE interrupts */
>  	writel(OCM_CEUE_MASK, priv->baseaddr + OCM_IEN_OFST);
> 
> -#ifdef CONFIG_EDAC_DEBUG
> -	setup_debugfs(dci);
> -#endif
> +	if (IS_ENABLED(CONFIG_EDAC_DEBUG))
> +		setup_debugfs(dci);
> 
>  	ret = edac_device_add_device(dci);
>  	if (ret)
> 
> below too.
Ok, will update.

> 
> > +
> > +	ret = edac_device_add_device(dci);
> > +	if (ret)
> > +		goto free_dev_ctl;
> > +
> > +	return 0;
> > +
> > +free_dev_ctl:
> > +	edac_device_free_ctl_info(dci);
> > +
> > +	return ret;
> > +}
> > +
> > +static int edac_remove(struct platform_device *pdev) {
> > +	struct edac_device_ctl_info *dci = platform_get_drvdata(pdev);
> > +	struct edac_priv *priv = dci->pvt_info;
> > +
> > +	/* Disable UE, CE interrupts */
> > +	writel(OCM_CEUE_MASK, priv->baseaddr + OCM_IDS_OFST);
> > +
> > +#ifdef CONFIG_EDAC_DEBUG
> > +	debugfs_remove_recursive(priv->debugfs_dir);
> > +#endif
> > +
> > +	edac_device_del_device(&pdev->dev);
> > +	edac_device_free_ctl_info(dci);
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct of_device_id zynqmp_ocm_edac_match[] = {
> > +	{ .compatible = "xlnx,zynqmp-ocmc-1.0"},
> > +	{ /* end of table */ }
> > +};
> > +
> > +MODULE_DEVICE_TABLE(of, zynqmp_ocm_edac_match);
> > +
> > +static struct platform_driver zynqmp_ocm_edac_driver = {
> > +	.driver = {
> > +		   .name = "zynqmp-ocm-edac",
> > +		   .of_match_table = zynqmp_ocm_edac_match,
> > +		   },
> > +	.probe = edac_probe,
> > +	.remove = edac_remove,
> > +};
> > +
> > +module_platform_driver(zynqmp_ocm_edac_driver);
> > +
> > +MODULE_AUTHOR("Advanced Micro Devices, Inc");
> > +MODULE_DESCRIPTION("Xilinx ZynqMP OCM ECC driver");
> > +MODULE_LICENSE("GPL");
> > --
> > 2.25.1
> >
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

<<attachment: winmail.dat>>


[Index of Archives]     [Device Tree Compilter]     [Device Tree Spec]     [Linux Driver Backports]     [Video for Linux]     [Linux USB Devel]     [Linux PCI Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Yosemite Backpacking]


  Powered by Linux