Re: [PATCH v5 5/5] EDAC: Versal NET: Add support for error notification

Borislav Petkov <bp@xxxxxxxxx> · Tue, 11 Feb 2025 10:40:02 +0100

On Mon, Jan 06, 2025 at 11:03:58AM +0530, Shubhrajyoti Datta wrote:
> Subject: Re: [PATCH v5 5/5] EDAC: Versal NET: Add support for error notification

Do a

  git log drivers/edac/

to get an idea how the subject format for that subsystem should be.

> The Versal NET edac listens to the notifications from NMC(Network

s/edac/EDAC/g

Check all your patches.

> management controller) on rpmsg.

I'm lucky Michal explains to me on internal chat what this all means.

Please write proper english and explain what "rpmsg" is.

> The driver registers on the error_edac channel.

What is "the error_edac channel"?

> Send a RAS event trace upon a notification. On reading the notification the
> user space application can decide on the response.  A sysfs reset entry is
> created for the same that sends an acknowledgment back to the NMC.

It says below "- remove reset". Please rewrite this commit message properly
- it is not write-only. Use LLM AI for the formulations.

> For reporting events register to the RAS framework. For memory mc events are
> used other events use non-standard events.

This basically explains what the patch does - I'd like to read why this
patch exists.

> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xxxxxxx>
> ---
> 
> Changes in v5:
> Update the compatible
> Update the handle_error documentation
> 
> Changes in v4:
> Update the compatible
> 
> Changes in v3:
> make remove void
> 
> Changes in v2:
> - remove reset
> - Add the remote proc requests
> - remove probe_once
> - reorder the rpmsg registration
> - the data width , rank and number of channel is read from message.
> 
>  drivers/edac/Kconfig                |    9 +
>  drivers/edac/Makefile               |    1 +
>  drivers/edac/versalnet_rpmsg_edac.c | 1325 +++++++++++++++++++++++++++
>  3 files changed, 1335 insertions(+)
>  create mode 100644 drivers/edac/versalnet_rpmsg_edac.c

Why is this thing called "versalnet_rpmsg_edac.c"? Why not simply "versalnet_edac.c"?

> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 06f7b43a6f78..4241936a8e7a 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -546,5 +546,14 @@ config EDAC_VERSAL
>  	  Support injecting both correctable and uncorrectable errors
>  	  for debugging purposes.
>  
> +config EDAC_VERSALNET
> +	tristate "AMD Versal NET EDAC"
> +	depends on CDX_CONTROLLER && ARCH_ZYNQMP
> +	help
> +	  Support for error detection and correction on the AMD Versal NET DDR
> +	  memory controller.
> +
> +	  The memory controller supports single bit error correction, double bit
> +	  error detection.

> Report various errors to the userspace.

That is superfluous - this text should only talk about your hw - not what
Linux does.

>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index f9cf19d8d13d..a8328522f9c3 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -86,3 +86,4 @@ obj-$(CONFIG_EDAC_DMC520)		+= dmc520_edac.o
>  obj-$(CONFIG_EDAC_NPCM)			+= npcm_edac.o
>  obj-$(CONFIG_EDAC_ZYNQMP)		+= zynqmp_edac.o
>  obj-$(CONFIG_EDAC_VERSAL)		+= versal_edac.o
> +obj-$(CONFIG_EDAC_VERSALNET)		+= versalnet_rpmsg_edac.o
> diff --git a/drivers/edac/versalnet_rpmsg_edac.c b/drivers/edac/versalnet_rpmsg_edac.c
> new file mode 100644
> index 000000000000..a54911f07c67
> --- /dev/null
> +++ b/drivers/edac/versalnet_rpmsg_edac.c
> @@ -0,0 +1,1325 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * AMD Versal NET memory controller driver
> + * Copyright (C) 2024 Advanced Micro Devices, Inc.
> + */
> +
> +#include <linux/edac.h>
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/ras.h>
> +#include <linux/rpmsg.h>
> +#include <linux/sizes.h>
> +#include <ras/ras_event.h>
> +#include <linux/remoteproc.h>
> +
> +#include "edac_module.h"
> +#include "../cdx/cdx.h"
> +#include "../cdx/controller/mcdi_functions.h"
> +#include "../cdx/controller/mcdi.h"
> +#include "../cdx/controller/mc_cdx_pcol.h"

This looks like a hack to make it work. 

Also, this seems to build too:

diff --git a/drivers/edac/versalnet_rpmsg_edac.c b/drivers/edac/versalnet_rpmsg_edac.c
index a54911f07c67..5998a677cecd 100644
--- a/drivers/edac/versalnet_rpmsg_edac.c
+++ b/drivers/edac/versalnet_rpmsg_edac.c
@@ -15,9 +15,9 @@
 
 #include "edac_module.h"
 #include "../cdx/cdx.h"
-#include "../cdx/controller/mcdi_functions.h"
+//#include "../cdx/controller/mcdi_functions.h"
 #include "../cdx/controller/mcdi.h"
-#include "../cdx/controller/mc_cdx_pcol.h"
+//#include "../cdx/controller/mc_cdx_pcol.h"
 
 /* Granularity of reported error in bytes */
 #define DDRMC5_EDAC_ERR_GRAIN			1


so it looks like you've been adding includes until it builds.

So, how about a proper export of functionality into the proper linux/
namespace like it is usually done?

> +/* Granularity of reported error in bytes */
> +#define DDRMC5_EDAC_ERR_GRAIN			1
> +#define MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN	4 /* 4 bytes */
> +
> +#define DDRMC5_EDAC_MSG_SIZE			256
> +
> +#define DDRMC5_IRQ_CE_MASK			GENMASK(18, 15)
> +#define DDRMC5_IRQ_UE_MASK			GENMASK(14, 11)
> +
> +#define DDRMC5_RANK_1_MASK			GENMASK(11, 6)
> +#define MASK_24					GENMASK(29, 24)
> +#define MASK_0					GENMASK(5, 0)
> +
> +#define DDRMC5_LRANK_1_MASK			GENMASK(11, 6)
> +#define DDRMC5_LRANK_2_MASK			GENMASK(17, 12)
> +#define DDRMC5_BANK1_MASK			GENMASK(11, 6)
> +#define DDRMC5_GRP_0_MASK			GENMASK(17, 12)
> +#define DDRMC5_GRP_1_MASK			GENMASK(23, 18)
> +
> +#define ECCR_UE_CE_ADDR_HI_ROW_MASK		GENMASK(10, 0)
> +
> +#define DDRMC5_MAX_ROW_CNT			18
> +#define DDRMC5_MAX_COL_CNT			11
> +#define DDRMC5_MAX_RANK_CNT			2
> +#define DDRMC5_MAX_LRANK_CNT			4
> +#define DDRMC5_MAX_BANK_CNT			2
> +#define DDRMC5_MAX_GRP_CNT			3
> +
> +#define DDRMC5_REGHI_ROW			7
> +#define DDRMC5_EACHBIT				1
> +#define DDRMC5_ERR_TYPE_CE			0
> +#define DDRMC5_ERR_TYPE_UE			1
> +#define DDRMC5_HIGH_MEM_EN			BIT(20)
> +#define DDRMC5_MEM_MASK				GENMASK(19, 0)
> +#define DDRMC5_X16_BASE				256
> +#define DDRMC5_X16_ECC				32
> +#define DDRMC5_X16_SIZE				(DDRMC5_X16_BASE + DDRMC5_X16_ECC)
> +#define DDRMC5_X32_SIZE				576
> +#define DDRMC5_HIMEM_BASE			(256 * SZ_1M)
> +#define DDRMC5_ILC_HIMEM_EN			BIT(28)
> +#define DDRMC5_ILC_MEM				GENMASK(27, 0)
> +#define DDRMC5_INTERLEAVE_SEL			GENMASK(3, 0)
> +#define DDRMC5_BUS_WIDTH_MASK			GENMASK(19, 18)
> +#define DDRMC5_NUM_CHANS_MASK			BIT(17)
> +#define DDRMC5_RANK_MASK			GENMASK(15, 14)
> +#define DDRMC5_DWIDTH_MASK			GENMASK(5, 4)
> +
> +#define AMD_MIN_BUF_LEN				0x28
> +#define AMD_ERROR_LEVEL				2
> +#define AMD_ERRORID				3
> +#define TOTAL_ERR_LENGTH			5
> +#define AMD_MSG_ERR_OFFSET			8
> +#define AMD_MSG_ERR_LENGTH			9
> +#define AMD_ERR_DATA				10
> +#define MCDI_RESPONSE				0xFF
> +
> +#define ERR_NOTIFICATION_MAX			96
> +#define REG_MAX					152
> +#define ADEC_MAX				152
> +#define NUM_CONTROLLERS				8
> +#define REGS_PER_CONTROLLER			19
> +#define ADEC_NUM				19
> +#define MC_CMD_EDAC_GET_OVERALL_DDR_CONFIG	2
> +#define BUFFER_SZ				80
> +
> +#define XDDR5_BUS_WIDTH_64			0
> +#define XDDR5_BUS_WIDTH_32			1
> +#define XDDR5_BUS_WIDTH_16			2
> +
> +/**
> + * struct ecc_error_info - ECC error log information.
> + * @burstpos:		Burst position.
> + * @lrank:		Logical Rank number.
> + * @rank:		Rank number.
> + * @group:		Group number.
> + * @bank:		Bank number.
> + * @col:		Column number.
> + * @row:		Row number.
> + * @rowhi:		Row number higher bits.
> + * @i:			ECC error info.
> + */
> +union ecc_error_info {
> +	struct {
> +		u32 burstpos:3;
> +		u32 lrank:4;
> +		u32 rank:2;
> +		u32 group:3;
> +		u32 bank:2;
> +		u32 col:11;
> +		u32 row:7;
> +		u32 rowhi;
> +	};
> +	u64 i;
> +} __packed;
> +
> +union edac_info {

What is an "edac_info"?

> +	struct {
> +		u32 row0:6;
> +		u32 row1:6;
> +		u32 row2:6;
> +		u32 row3:6;
> +		u32 row4:6;
> +		u32 reserved:2;
> +	};
> +	struct {
> +		u32 col1:6;
> +		u32 col2:6;
> +		u32 col3:6;
> +		u32 col4:6;
> +		u32 col5:6;
> +		u32 reservedcol:2;
> +	};
> +	u32 i;
> +} __packed;
> +
> +/**
> + * struct ack - Acknowledgment information to report.
> + * @error_level:	Error level.
> + * @error_id:		Error id ectable error log information.

Unknown word [ectable] in comment.
Suggestions: ['eatable', 'editable', 'equatable', 'equitable', 'electable', 'equable', 'educable', 'octal', 'vegetable', 'equitably', 'edible', 'uneatable', 'quotable']

Use a spellchecker please.

> + * @next_action:	Next action to be taken.
> + */
> +struct ack {
> +	u32 error_level;
> +	u32 error_id;
> +	u32 next_action;
> +};
> +
> +/**
> + * struct ecc_status - ECC status information to report.
> + * @ceinfo:	Correctable error log information.
> + * @ueinfo:	Uncorrectable error log information.
> + * @channel:	Channel number.
> + * @error_type:	Error type information.
> + */
> +struct ecc_status {
> +	union ecc_error_info ceinfo[2];
> +	union ecc_error_info ueinfo[2];
> +	u8 channel;
> +	u8 error_type;
> +};
> +
> +/**
> + * struct edac_priv - DDR memory controller private instance data.

So it pertains to the memory controller not to EDAC. So call it that.

> + * @message:		Buffer for framing the event specific info.
> + * @ce_cnt:		Correctable error count.
> + * @ue_cnt:		UnCorrectable error count.
> + * @stat:		ECC status information.
> + * @lrank_bit:		Bit shifts for lrank bit.
> + * @rank_bit:		Bit shifts for rank bit.
> + * @row_bit:		Bit shifts for row bit.
> + * @col_bit:		Bit shifts for column bit.
> + * @bank_bit:		Bit shifts for bank bit.
> + * @grp_bit:		Bit shifts for group bit.
> + * @error_id:		The error id.
> + * @error_level:	The error level.
> + * @dwidth:		The dwidth.

Say what now? What is the "dwidth"?

> + * @part_len:		The subpart of the message received.
> + * @regs:		The registers sent on the rpmsg.
> + * @adec:		Address decode registers.
> + * @mci:		Memory controller interface.
> + * @ept:		rpmsg endpoint.
> + * @mcdi:		The mcdi handle.
> + * @work:		The work queue.
> + * @xfer_done:		The completion indicator for the rpmsg.
> + */
> +struct edac_priv {
> +	char message[DDRMC5_EDAC_MSG_SIZE];
> +	u32 ce_cnt;
> +	u32 ue_cnt;
> +	struct ecc_status stat;
> +	u32 lrank_bit[4];
> +	u32 rank_bit[2];
> +	u32 row_bit[18];
> +	u32 col_bit[11];
> +	u32 bank_bit[2];
> +	u32 grp_bit[3];
> +	u32 error_id;
> +	u32 error_level;
> +	enum dev_type dwidth;
> +	u32 part_len;
> +	u32 regs[REG_MAX];
> +	u32 adec[ADEC_MAX];
> +	struct mem_ctl_info *mci;
> +	struct rpmsg_endpoint *ept;
> +	struct cdx_mcdi *mcdi;
> +	struct work_struct work;
> +	struct completion xfer_done;
> +};
> +
> +enum error_ids {

I'm not sure those are really needed: they're used once and you can just as
well use the naked numbers with comments above them in the switch-case. For
example:

	/* FPX SPLITTER error */
        case 132 ... 139:
                snprintf(edac_priv->message, DDRMC5_EDAC_MSG_SIZE,
                         "Error type: FPX SPLITTER Error %d", error_id);

and so on.

> +	BOOT_CR = 0,
> +	BOOT_NCR = 1,
> +	FW_CR = 2,
> +	FW_NCR = 3,
> +	GSW_CR = 4,
> +	GSW_NCR = 5,
> +	CFU = 6,
> +	CFRAME = 7,
> +	PSM_CR = 8,
> +	PSM_NCR = 9,
> +	DDRMC5_MB_CR = 10,
> +	DDRMC5_MB_NCR = 11,
> +	NOCTYPE_CR = 12,
> +	NOCTYPE_NCR = 13,
> +	NOCUSER = 14,
> +	MMCM = 15,
> +	HNICX_CR = 16,
> +	HNICX_NCR = 17,
> +	DDRMC5_CR = 18,
> +	DDRMC5_NCR = 19,
> +	GT_CR = 20,
> +	GT_NCR = 21,
> +	PLSYSMON_CR = 22,
> +	PLSYSMON_NCR = 23,
> +	PL0 = 24,
> +	PL1 = 25,
> +	PL2 = 26,
> +	PL3 = 27,
> +	NPI_ROOT = 28,
> +	SSIT_ERR3 = 29,
> +	SSIT_ERR4 = 30,
> +	SSIT_ERR5 = 31,
> +	PMC_APB = 32,
> +	PMC_ROM = 33,
> +	MB_FATAL0 = 34,
> +	MB_FATAL1 = 35,
> +	PMC_CR = 36,
> +	PMC_NCR = 37,
> +	PMC_SMON0 = 39,
> +	PMC_SMON1 = 40,
> +	PMC_SMON2 = 41,
> +	PMC_SMON3 = 42,
> +	PMC_SMON4 = 43,
> +	PMC_SMON8 = 47,
> +	PMC_SMON9 = 48,
> +	CFI = 49,
> +	CFRAME_SEU_CRC = 50,
> +	CFRAME_SEU_ECC = 51,
> +	PMX_WWDT = 52,
> +	RTC_ALARM = 54,
> +	NPLL = 55,
> +	PPLL = 56,
> +	CLK_MON = 57,
> +	INT_PMX_CORR_ERR = 59,
> +	INT_PMX_UNCORR_ERR = 60,
> +	SSIT_ERR0 = 61,
> +	SSIT_ERR1 = 62,
> +	SSIT_ERR2 = 63,
> +	IOU_CR = 64,
> +	IOU_NCR = 65,
> +	DFX_UXPT_ACT = 66,
> +	DICE_CDI_PAR = 67,
> +	DEVIK_PRIV = 68,
> +	NXTSW_CDI_PAR = 69,
> +	DEVAK_PRIV = 70,
> +	DME_PUB_X = 71,
> +	DME_PUB_Y = 72,
> +	DEVAK_PUB_X = 73,
> +	DEVAK_PUB_Y = 74,
> +	DEVIK_PUB_X = 75,
> +	DEVIK_PUB_Y = 76,
> +	PCR_PAR = 77,
> +	PS_SW_CR = 96,
> +	PS_SW_NCR = 97,
> +	PSM_B_CR = 98,
> +	PSM_B_NCR = 99,
> +	MB_FATAL = 100,
> +	PSMX_CHK = 103,
> +	APLL1_LOCK = 104,
> +	APLL2_LOCK = 105,
> +	RPLL_LOCK = 106,
> +	FLXPLL_LOCK = 107,
> +	INT_PSM_CR = 108,
> +	INT_PSM_NCR = 109,
> +	USB_ERR = 110,
> +	LPX_DFX = 111,
> +	INT_LPD_CR = 113,
> +	INT_LPD_NCR = 114,
> +	INT_OCM_CR = 115,
> +	INT_OCM_NCR = 116,
> +	INT_FPD_CR = 117,
> +	INT_FPD_NCR = 118,
> +	INT_IOU_CR = 119,
> +	INT_IOU_NCR = 120,
> +	RPU_CLUSTERA_LS = 121,
> +	RPU_CLUSTERB_LS = 122,
> +	GIC_AXI = 123,
> +	GIC_ECC = 124,
> +	CPM_CR = 125,
> +	CPM_NCR = 126,
> +	FPD_CPI = 127,
> +	FPD_SWDT0 = 128,
> +	FPD_SWDT1 = 129,
> +	FPD_SWDT2 = 130,
> +	FPD_SWDT3 = 131,
> +	FPX_SPLITTER0_MEM_ERR = 132,
> +	FPX_SPLITTER0_AXI_ERR = 133,
> +	FPX_SPLITTER1_MEM_ERR = 134,
> +	FPX_SPLITTER1_AXI_ERR = 135,
> +	FPX_SPLITTER2_MEM_ERR = 136,
> +	FPX_SPLITTER2_AXI_ERR = 137,
> +	FPX_SPLITTER3_MEM_ERR = 138,
> +	FPX_SPLITTER3_AXI_ERR = 139,
> +	APU0 = 140,
> +	APU1 = 141,
> +	APU2 = 142,
> +	APU3 = 143,
> +	LPX_WWDT0 = 144,
> +	LPX_WWDT1 = 145,
> +	ADMA_LS_ERR = 146,
> +	IPI_ERR = 147,
> +	OCM0_CORR_ERR = 148,
> +	OCM1_CORR_ERR = 149,
> +	OCM0_UNCORR_ERR = 150,
> +	OCM1_UNCORR_ERR = 151,
> +	LPX_AFIFS_CORR_ERR = 152,
> +	LPX_AFIFS_UNCORR_ERR = 153,
> +	LPX_GLITCH_DET0 = 154,
> +	LPX_GLITCH_DET1 = 155,
> +	NOC_NMU_FIREWALL_WR_ERR = 156,
> +	NOC_NMU_FIREWALL_RD_ERR = 157,
> +	NOC_NSU_FIREWALL_ERR = 158,
> +	RPUA0_CORR_ERR = 163,
> +	RPUA0_MISC1 = 166,
> +	RPUA0_MISC2 = 167,
> +	RPUA1_CORR_ERR = 168,
> +	RPUA1_FATAL_ERR = 169,
> +	RPUA1_TIMEOUT_ERR = 170,
> +	RPUA1_MISC1 = 171,
> +	RPUA1_MISC2 = 172,
> +	RPUB0_CORR_ERR = 173,
> +	RPUB0_FATAL_ERR = 174,
> +	RPUB0_TIMEOUT_ERR = 175,
> +	RPUB0_MISC1 = 176,
> +	RPUB0_MISC2 = 177,
> +	RPUB1_CORR_ERR = 178,
> +	RPUB1_FATAL_ERR = 179,
> +	RPUB1_TIMEOUT_ERR = 180,
> +	RPUB1_MISC1 = 181,
> +	RPUB1_MISC2 = 182,
> +	RPU_PCIL_ERR = 184,
> +	FPX_AFIFS_CORR_ERR = 185,
> +	FPX_AFIFS_UNCORR_ERR = 186,
> +	FPD_CMN_1_ERR = 187,
> +	FPD_CMN_2_ERR = 188,
> +	FPD_CMN_3_ERR = 189,
> +	FPD_CML_ERR = 190,
> +	FPD_INTWRAP_ERR = 191,
> +	FPD_REST_MON_ERR = 192,
> +	LPD_MON_ERR = 193,
> +	AFIFM_FATAL_ERR = 194,
> +	LPX_AFIFM_NONFATAL_ERR = 195,
> +	FPX_AFIFM0_NONFATAL_ERR = 196,
> +	FPX_AFIFM1_NONFATAL_ERR = 197,
> +	FPX_AFIFM2_NONFATAL_ERR = 198,
> +	FPX_AFIFM3_NONFATAL_ERR = 199,
> +	RPU_CLUSTERA_ERR = 200,
> +	RPU_CLUSTERB_ERR = 201,
> +	HB_MON_0 = 224,
> +	HB_MON_1 = 225,
> +	HB_MON_2 = 226,
> +	HB_MON_3 = 227,
> +	PLM_EXCEPTION = 228,
> +	PCR_LOG_UPDATE = 230,
> +	ERROR_CRAM_CE = 231,
> +	ERROR_CRAM_UE = 232,
> +	ERROR_NPI_UE = 233,
> +};
> +
> +enum adec_info {

What is "ADEC"? Some permutaion of "EDAC"?

> +	CONF = 0,
> +	ADEC0,
> +	ADEC1,
> +	ADEC2,
> +	ADEC3,
> +	ADEC4,
> +	ADEC5,
> +	ADEC6,
> +	ADEC7,
> +	ADEC8,
> +	ADEC9,
> +	ADEC10,
> +	ADEC11,
> +	ADEC12,
> +	ADEC13,
> +	ADEC14,
> +	ADEC15,
> +	ADEC16,
> +	ADECILC,
> +};
> +
> +enum reg_info {
> +	ISR = 0,
> +	IMR,
> +	ECCR0_ERR_STATUS,
> +	ECCR0_ADDR_LO,
> +	ECCR0_ADDR_HI,
> +	ECCR0_DATA_LO,
> +	ECCR0_DATA_HI,
> +	ECCR0_PAR,
> +	ECCR1_ERR_STATUS,
> +	ECCR1_ADDR_LO,
> +	ECCR1_ADDR_HI,
> +	ECCR1_DATA_LO,
> +	ECCR1_DATA_HI,
> +	ECCR1_PAR,
> +	XMPU_ERR,
> +	XMPU_ERR_ADDR_L0,
> +	XMPU_ERR_ADDR_HI,
> +	XMPU_ERR_AXI_ID,
> +	ADEC_CHK_ERR_LOG,
> +};
> +
> +static void get_ddr_ce_error_info(u32 *error_data, struct edac_priv *priv)
> +{
> +	u32 regval, par, reghi;

"parity" is still short enough but a lot more understandable.

> +	struct ecc_status *p;
> +
> +	p = &priv->stat;
> +
> +	regval = error_data[ECCR0_ADDR_HI];
> +	reghi = regval & ECCR_UE_CE_ADDR_HI_ROW_MASK;

So "regval" should be "reglo"?

> +	regval = error_data[ECCR0_ADDR_LO];
> +	p->ceinfo[0].i = regval | (u64)reghi << 32;
> +
> +	par = error_data[ECCR0_PAR];
> +	edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
> +		 reghi, regval, par);
> +
> +	regval = error_data[ECCR1_ADDR_LO];
> +	reghi = error_data[ECCR1_ADDR_HI];
> +	p->ceinfo[1].i = regval | (u64)reghi << 32;
> +
> +	par = error_data[ECCR1_PAR];
> +	edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
> +		 reghi, regval, par);
> +}
> +
> +static void get_ddr_ue_error_info(u32 error_data[REGS_PER_CONTROLLER], struct edac_priv *priv)
				     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What is that for?

> +{
> +	u32 regval, par, reghi;
> +	struct ecc_status *p;
> +
> +	p = &priv->stat;
> +
> +	regval = error_data[ECCR0_ADDR_LO];
> +	reghi = error_data[ECCR0_ADDR_HI];
> +	par = error_data[ECCR0_PAR];
> +
> +	p->ueinfo[0].i = regval | (u64)reghi << 32;
> +
> +	edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
> +		 reghi, regval, par);
> +
> +	regval = error_data[ECCR1_ADDR_LO];
> +	reghi = error_data[ECCR1_ADDR_HI];
> +	p->ueinfo[1].i = regval | (u64)reghi << 32;
> +	par = error_data[ECCR1_PAR];
> +
> +	edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n",
> +		 reghi, regval, par);
> +}

Same comments as for get_ddr_ce_error_info().

Looking at those functions, you can actually merge them into one.

> +static bool get_ddr_ue_info(u32 error_data[REGS_PER_CONTROLLER], struct edac_priv *priv)
> +{
> +	u32 eccr0_val, eccr1_val, isr;
> +	struct ecc_status *p;
> +
> +	p = &priv->stat;
> +
> +	isr = error_data[ISR];
> +	if (!(isr & DDRMC5_IRQ_UE_MASK))
> +		return false;
> +
> +	eccr0_val = error_data[ECCR0_ERR_STATUS];
> +	eccr1_val = error_data[ECCR1_ERR_STATUS];
> +
> +	if (!eccr0_val && !eccr1_val)
> +		return false;
> +
> +	if (!eccr0_val)
> +		p->channel = 1;
> +	else
> +		p->channel = 0;
> +
> +	get_ddr_ue_error_info(error_data, priv);
> +
> +	return true;
> +}
> +
> +static bool get_ddr_ce_info(u32 error_data[REGS_PER_CONTROLLER], struct edac_priv *priv)
> +{
> +	u32 eccr0_val, eccr1_val, isr;
> +	struct ecc_status *p;
> +
> +	p = &priv->stat;
> +
> +	isr = error_data[ISR];
> +	if (!(isr & DDRMC5_IRQ_CE_MASK))
> +		return false;
> +
> +	eccr0_val = error_data[ECCR0_ERR_STATUS];
> +	eccr1_val = error_data[ECCR1_ERR_STATUS];
> +
> +	if (!eccr0_val && !eccr1_val)
> +		return false;
> +
> +	if (!eccr0_val)
> +		p->channel = 1;
> +	else
> +		p->channel = 0;
> +
> +	get_ddr_ce_error_info(error_data, priv);
> +
> +	return true;
> +}

Also unify into a single function. So basically the above 4 could be a single
function.

> +
> +/**
> + * handle_error - Handle Correctable and Uncorrectable errors.
> + * @priv:	DDR memory controller private instance data.
> + * @stat:	ECC status structure.
> + * @controller:	Controller number of the DDRMC5
> + *
> + * Handles ECC correctable and uncorrectable errors.
> + */
> +static void handle_error(struct edac_priv  *priv, struct ecc_status *stat, int controller)
> +{
> +	struct mem_ctl_info *mci = priv->mci;
> +	union ecc_error_info pinf;
> +	unsigned long pa;
> +	phys_addr_t pfn;
> +	int err;
> +
> +	if (stat->error_type == DDRMC5_ERR_TYPE_CE) {
> +		priv->ce_cnt++;
> +		pinf = stat->ceinfo[stat->channel];
> +		snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE,
> +			 "Error type:%s Addr at %lx\n",
> +			 "CE", convert_to_physical(priv, pinf, controller));
> +
> +		edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> +				     1, 0, 0, 0, 0, 0, -1,
> +				     priv->message, "");
> +	}
> +
> +	if (stat->error_type == DDRMC5_ERR_TYPE_UE) {
> +		priv->ue_cnt++;
> +		pinf = stat->ueinfo[stat->channel];
> +		snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE,
> +			 "Error type:%s Addr at %lx\n",
> +			 "UE", convert_to_physical(priv, pinf, controller));
> +
> +		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> +				     1, 0, 0, 0, 0, 0, -1,
> +				     priv->message, "");
> +		pa = convert_to_physical(priv, pinf, controller);
> +		pfn = PHYS_PFN(pa);
> +
> +		if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) {
> +			err = memory_failure(pfn, MF_ACTION_REQUIRED);
> +			if (err)
> +				edac_dbg(2, "In fail of memory_failure %d\n", err);
> +			else
> +				edac_dbg(2, "Page at PA 0x%lx is hardware poisoned\n", pa);
> +		}
> +	}
> +
> +	memset(stat, 0, sizeof(*stat));

This is the wrong ordering - you either clear it on entry or the caller should
clear it - the caller should not rely on this function clearing it for the
next function it is going to call.

> +
> +/**
> + * mc_init - Initialize one driver instance.
> + * @mci:	EDAC memory controller instance.
> + * @pdev:	platform device.
> + *
> + * Perform initialization of the EDAC memory controller instance and
> + * related driver-private data associated with the memory controller the
> + * instance is bound to.

The stuff which needs commenting doesn't have any but the obvious ones have
comments which are more than necessary. :-\

> + */
> +static void mc_init(struct mem_ctl_info *mci, struct platform_device *pdev)
> +{
> +	mci->pdev = &pdev->dev;
> +	platform_set_drvdata(pdev, mci);
> +
> +	/* Initialize controller capabilities and configuration */
> +	mci->mtype_cap = MEM_FLAG_DDR5;
> +	mci->edac_ctl_cap = EDAC_FLAG_NONE | EDAC_FLAG_SECDED;
> +	mci->scrub_cap = SCRUB_HW_SRC;
> +	mci->scrub_mode = SCRUB_NONE;
> +
> +	mci->edac_cap = EDAC_FLAG_SECDED;
> +	mci->ctl_name = "amd_ddr_controller";
> +	mci->dev_name = dev_name(&pdev->dev);
> +	mci->mod_name = "amd_edac";

Do:

git grep mod_name drivers/edac/

to get an idea how those names are chosen.

> +	edac_op_state = EDAC_OPSTATE_INT;
> +
> +	init_csrows(mci);
> +}
> +
> +#define to_mci(k) container_of(k, struct mem_ctl_info, dev)
> +
> +static int amd_rpmsg_send(struct cdx_mcdi *cdx_mcdi,
> +			  const struct cdx_dword *hdr, size_t hdr_len,
> +			  const struct cdx_dword *sdu, size_t sdu_len)

Used only once - fold it into the call site. And static functions don't need
silly name prefixes like "amd_".

> +{
> +	unsigned char *send_buf;
> +	int ret;
> +
> +	send_buf = kzalloc(hdr_len + sdu_len, GFP_KERNEL);
> +	if (!send_buf)
> +		return -ENOMEM;
> +
> +	memcpy(send_buf, hdr, hdr_len);
> +	memcpy(send_buf + hdr_len, sdu, sdu_len);
> +
> +	ret = rpmsg_send(cdx_mcdi->ept, send_buf, hdr_len + sdu_len);
> +	kfree(send_buf);
> +	return ret;
> +}
> +



> +static int get_ddr_config(u32 index, u32 *buffer, struct cdx_mcdi *amd_mcdi)
> +{
> +	size_t outlen;
> +	int ret;
> +
> +	MCDI_DECLARE_BUF(inbuf, MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN);
> +	MCDI_DECLARE_BUF(outbuf, BUFFER_SZ);
> +
> +	MCDI_SET_DWORD(inbuf, EDAC_GET_DDR_CONFIG_IN_CONTROLLER_INDEX, index);
> +
> +	ret = cdx_mcdi_rpc(amd_mcdi, MC_CMD_EDAC_GET_DDR_CONFIG, inbuf, sizeof(inbuf),
> +			   outbuf, sizeof(outbuf), &outlen);
> +	if (ret)
> +		return ret;
> +	memcpy(buffer, MCDI_PTR(outbuf, EDAC_GET_DDR_CONFIG_OUT_REGISTER_VALUES), (ADEC_NUM * 4));
> +	return 0;

Function returning a value which no one uses.

> +}
> +
> +static int amd_setup_mcdi(struct edac_priv *edac_priv)
> +{
> +	struct cdx_mcdi *amd_mcdi;
> +	int ret, i;
> +
> +	amd_mcdi = kzalloc(sizeof(*amd_mcdi), GFP_KERNEL);
> +	if (!amd_mcdi)
> +		return -ENOMEM;
> +
> +	/* Store the MCDI ops */

Useless comment.

> +	amd_mcdi->mcdi_ops = &mcdi_ops;
> +	/* MCDI FW: Initialize the FW path */

Ditto.

> +	ret = cdx_mcdi_init(amd_mcdi);
> +	if (ret)
> +		return ret;

And here you leaked amd_mcdi when you returned.

> +	amd_mcdi->ept = edac_priv->ept;
> +	edac_priv->mcdi = amd_mcdi;
> +
> +	for (i = 0; i < NUM_CONTROLLERS; i++)
> +		get_ddr_config(i, &edac_priv->adec[ADEC_NUM * i], amd_mcdi);
> +
> +	complete(&edac_priv->xfer_done);

That looks like a hack.

> +	return 0;

Ditto.

> +}
> +
> +static inline void process_bit(struct edac_priv *priv, unsigned int start, u32 regval)

You don't need "inline" - the compiler can decide that itself. And
"process_bit" needs a better name.

> +{
> +	union edac_info rows;
> +
> +	rows.i = regval;
> +	priv->row_bit[start] = rows.row0;
> +	priv->row_bit[start + 1] = rows.row1;
> +	priv->row_bit[start + 2] = rows.row2;
> +	priv->row_bit[start + 3] = rows.row3;
> +	priv->row_bit[start + 4] = rows.row4;
> +}
> +
> +static void setup_row_address_map(struct edac_priv *priv, u32 *error_data)
> +{
> +	union edac_info rows;
> +	u32 regval;
> +
> +	regval = error_data[ADEC6];
> +	process_bit(priv, 0, regval);
> +
> +	regval = error_data[ADEC7];
> +	process_bit(priv, 5, regval);
> +
> +	regval = error_data[ADEC8];
> +	process_bit(priv, 10, regval);
> +
> +	regval = error_data[ADEC9];
> +	rows.i = regval;
> +
> +	priv->row_bit[15] = rows.row0;
> +	priv->row_bit[16] = rows.row1;
> +	priv->row_bit[17] = rows.row2;
> +}
> +
> +static void setup_column_address_map(struct edac_priv *priv, u32 *error_data)
> +{
> +	union edac_info cols;
> +	u32 regval;
> +
> +	regval = error_data[ADEC9];
> +	priv->col_bit[0] = FIELD_GET(MASK_24, regval);
> +
> +	regval = error_data[ADEC10];
> +	cols.i = regval;
> +	priv->col_bit[1] = cols.col1;
> +	priv->col_bit[2] = cols.col2;
> +	priv->col_bit[3] = cols.col3;
> +	priv->col_bit[4] = cols.col4;
> +	priv->col_bit[5] = cols.col5;
> +
> +	regval = error_data[ADEC11];
> +	cols.i = regval;
> +	priv->col_bit[6] = cols.col1;
> +	priv->col_bit[7] = cols.col2;
> +	priv->col_bit[8] = cols.col3;
> +	priv->col_bit[9] = cols.col4;
> +	priv->col_bit[10] = cols.col5;
> +}

Why are those functions copying stuff around? Why aren't you using cols
directly?

> +static inline bool is_response(u8 *data)
> +{
> +	return (data[0] == MCDI_RESPONSE);
> +}

Zap that silly function.

> +
> +static struct rpmsg_device_id amd_rpmsg_id_table[] = {
> +	{ .name = "error_ipc" },
> +	{ },
> +};
> +MODULE_DEVICE_TABLE(rpmsg, amd_rpmsg_id_table);
> +
> +static void amd_rpmsg_post_probe_work(struct work_struct *work)
> +{
> +	struct edac_priv *priv;
> +
> +	priv = container_of(work, struct edac_priv, work);
> +	amd_setup_mcdi(priv);
> +}

Why is probing a work item?

Explaining *that* is what a commit message is for - not for repeating useless
info.

> +static int amd_rpmsg_probe(struct rpmsg_device *rpdev)
> +{
> +	struct rpmsg_channel_info chinfo = {0};
> +	struct edac_priv *pg;
> +
> +	pg = (struct edac_priv *)amd_rpmsg_id_table[0].driver_data;
> +	chinfo.src = RPMSG_ADDR_ANY;
> +	chinfo.dst = rpdev->dst; /* NMC */

verify_comment_style: WARNING: No tail comments please:
 drivers/edac/versalnet_rpmsg_edac.c:1139 [+	chinfo.dst = rpdev->dst; /* NMC */]

Check your whole driver.

> +	strscpy(chinfo.name, amd_rpmsg_id_table[0].name,
> +		strlen(amd_rpmsg_id_table[0].name));
> +
> +	pg->ept = rpmsg_create_ept(rpdev, amd_rpmsg_cb, NULL, chinfo);
> +	if (!pg->ept)
> +		return dev_err_probe(&rpdev->dev, -ENXIO,
> +			      "Failed to create ept for channel %s\n",
> +			      chinfo.name);
> +
> +	dev_set_drvdata(&rpdev->dev, pg);
> +
> +	schedule_work(&pg->work);
> +
> +	return 0;
> +}
> +

> +static int mc_probe(struct platform_device *pdev)
> +{
> +	unsigned long time_left, wait_jiffies;
> +	u32 num_chans, rank, dwidth, config;
> +	struct device_node *r5_core_node;
> +	struct edac_mc_layer layers[2];
> +	struct mem_ctl_info *mci;
> +	struct edac_priv *priv;
> +	struct rproc *rp;
> +	enum dev_type dt;
> +	int rc, i;
> +
> +	r5_core_node = of_parse_phandle(pdev->dev.of_node, "amd,rproc", 0);
> +	if (!r5_core_node) {
> +		dev_err(&pdev->dev, "amd,rproc: invalid phandle\n");
> +		return -EINVAL;
> +	}
> +
> +	rp = rproc_get_by_phandle(r5_core_node->phandle);
> +	if (!rp)
> +		return -EPROBE_DEFER;
> +
> +	rc = rproc_boot(rp);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to attach to remote processor\n");
> +		rproc_put(rp);
> +		return rc;
> +	}
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	amd_rpmsg_id_table[0].driver_data = (kernel_ulong_t)priv;
> +	INIT_WORK(&priv->work, amd_rpmsg_post_probe_work);
> +	init_completion(&priv->xfer_done);
> +	rc = register_rpmsg_driver(&amd_rpmsg_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MC,
> +			    "Failed to register RPMsg driver: %d\n", rc);
> +		goto free_rproc;
> +	}
> +	wait_jiffies = msecs_to_jiffies(10000);
> +	time_left = wait_for_completion_timeout(&priv->xfer_done, wait_jiffies);
> +	if (time_left == 0)
> +		goto free_rpmsg;

Yah, this needs explanation as to why it is there.

> +	for (i = 0; i < NUM_CONTROLLERS; i++) {
> +		config = priv->adec[CONF + i];
> +		num_chans = FIELD_GET(DDRMC5_NUM_CHANS_MASK, config);
> +		rank = FIELD_GET(DDRMC5_RANK_MASK, config);
> +		rank = 1 << rank;
> +		dwidth = FIELD_GET(DDRMC5_BUS_WIDTH_MASK, config);
> +		dt = get_dwidth(dwidth);
> +		if (dt != DEV_UNKNOWN)
> +			break;
> +	}

What is that loop supposed to do? Find the last controller before the one
with DEV_UNKNOWN device width and register that one?

> +	layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
> +	layers[0].size = rank;
> +	layers[0].is_virt_csrow = true;
> +	layers[1].type = EDAC_MC_LAYER_CHANNEL;
> +	layers[1].size = num_chans;
> +	layers[1].is_virt_csrow = false;
> +
> +	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers,
> +			    sizeof(struct edac_priv));
> +	if (!mci) {
> +		edac_printk(KERN_ERR, EDAC_MC,
> +			    "Failed memory allocation for mc instance\n");
> +		rc = -ENOMEM;
> +		goto free_rproc;
> +	}
> +	priv->mci = mci;
> +
> +	priv->dwidth = dt;
> +	mc_init(mci, pdev);
> +	rc = edac_mc_add_mc(mci);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MC,
> +			    "Failed to register with EDAC core\n");
> +		goto free_edac_mc;
> +	}

Yeah, in any case, this needs a lot more explanation how all the parts are
supposed to work together.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette