On Mon, Jan 06, 2025 at 11:03:58AM +0530, Shubhrajyoti Datta wrote: > Subject: Re: [PATCH v5 5/5] EDAC: Versal NET: Add support for error notification Do a git log drivers/edac/ to get an idea how the subject format for that subsystem should be. > The Versal NET edac listens to the notifications from NMC(Network s/edac/EDAC/g Check all your patches. > management controller) on rpmsg. I'm lucky Michal explains to me on internal chat what this all means. Please write proper english and explain what "rpmsg" is. > The driver registers on the error_edac channel. What is "the error_edac channel"? > Send a RAS event trace upon a notification. On reading the notification the > user space application can decide on the response. A sysfs reset entry is > created for the same that sends an acknowledgment back to the NMC. It says below "- remove reset". Please rewrite this commit message properly - it is not write-only. Use LLM AI for the formulations. > For reporting events register to the RAS framework. For memory mc events are > used other events use non-standard events. This basically explains what the patch does - I'd like to read why this patch exists. > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xxxxxxx> > --- > > Changes in v5: > Update the compatible > Update the handle_error documentation > > Changes in v4: > Update the compatible > > Changes in v3: > make remove void > > Changes in v2: > - remove reset > - Add the remote proc requests > - remove probe_once > - reorder the rpmsg registration > - the data width , rank and number of channel is read from message. > > drivers/edac/Kconfig | 9 + > drivers/edac/Makefile | 1 + > drivers/edac/versalnet_rpmsg_edac.c | 1325 +++++++++++++++++++++++++++ > 3 files changed, 1335 insertions(+) > create mode 100644 drivers/edac/versalnet_rpmsg_edac.c Why is this thing called "versalnet_rpmsg_edac.c"? Why not simply "versalnet_edac.c"? > diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig > index 06f7b43a6f78..4241936a8e7a 100644 > --- a/drivers/edac/Kconfig > +++ b/drivers/edac/Kconfig > @@ -546,5 +546,14 @@ config EDAC_VERSAL > Support injecting both correctable and uncorrectable errors > for debugging purposes. > > +config EDAC_VERSALNET > + tristate "AMD Versal NET EDAC" > + depends on CDX_CONTROLLER && ARCH_ZYNQMP > + help > + Support for error detection and correction on the AMD Versal NET DDR > + memory controller. > + > + The memory controller supports single bit error correction, double bit > + error detection. > Report various errors to the userspace. That is superfluous - this text should only talk about your hw - not what Linux does. > endif # EDAC > diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile > index f9cf19d8d13d..a8328522f9c3 100644 > --- a/drivers/edac/Makefile > +++ b/drivers/edac/Makefile > @@ -86,3 +86,4 @@ obj-$(CONFIG_EDAC_DMC520) += dmc520_edac.o > obj-$(CONFIG_EDAC_NPCM) += npcm_edac.o > obj-$(CONFIG_EDAC_ZYNQMP) += zynqmp_edac.o > obj-$(CONFIG_EDAC_VERSAL) += versal_edac.o > +obj-$(CONFIG_EDAC_VERSALNET) += versalnet_rpmsg_edac.o > diff --git a/drivers/edac/versalnet_rpmsg_edac.c b/drivers/edac/versalnet_rpmsg_edac.c > new file mode 100644 > index 000000000000..a54911f07c67 > --- /dev/null > +++ b/drivers/edac/versalnet_rpmsg_edac.c > @@ -0,0 +1,1325 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * AMD Versal NET memory controller driver > + * Copyright (C) 2024 Advanced Micro Devices, Inc. > + */ > + > +#include <linux/edac.h> > +#include <linux/module.h> > +#include <linux/of_device.h> > +#include <linux/ras.h> > +#include <linux/rpmsg.h> > +#include <linux/sizes.h> > +#include <ras/ras_event.h> > +#include <linux/remoteproc.h> > + > +#include "edac_module.h" > +#include "../cdx/cdx.h" > +#include "../cdx/controller/mcdi_functions.h" > +#include "../cdx/controller/mcdi.h" > +#include "../cdx/controller/mc_cdx_pcol.h" This looks like a hack to make it work. Also, this seems to build too: diff --git a/drivers/edac/versalnet_rpmsg_edac.c b/drivers/edac/versalnet_rpmsg_edac.c index a54911f07c67..5998a677cecd 100644 --- a/drivers/edac/versalnet_rpmsg_edac.c +++ b/drivers/edac/versalnet_rpmsg_edac.c @@ -15,9 +15,9 @@ #include "edac_module.h" #include "../cdx/cdx.h" -#include "../cdx/controller/mcdi_functions.h" +//#include "../cdx/controller/mcdi_functions.h" #include "../cdx/controller/mcdi.h" -#include "../cdx/controller/mc_cdx_pcol.h" +//#include "../cdx/controller/mc_cdx_pcol.h" /* Granularity of reported error in bytes */ #define DDRMC5_EDAC_ERR_GRAIN 1 so it looks like you've been adding includes until it builds. So, how about a proper export of functionality into the proper linux/ namespace like it is usually done? > +/* Granularity of reported error in bytes */ > +#define DDRMC5_EDAC_ERR_GRAIN 1 > +#define MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN 4 /* 4 bytes */ > + > +#define DDRMC5_EDAC_MSG_SIZE 256 > + > +#define DDRMC5_IRQ_CE_MASK GENMASK(18, 15) > +#define DDRMC5_IRQ_UE_MASK GENMASK(14, 11) > + > +#define DDRMC5_RANK_1_MASK GENMASK(11, 6) > +#define MASK_24 GENMASK(29, 24) > +#define MASK_0 GENMASK(5, 0) > + > +#define DDRMC5_LRANK_1_MASK GENMASK(11, 6) > +#define DDRMC5_LRANK_2_MASK GENMASK(17, 12) > +#define DDRMC5_BANK1_MASK GENMASK(11, 6) > +#define DDRMC5_GRP_0_MASK GENMASK(17, 12) > +#define DDRMC5_GRP_1_MASK GENMASK(23, 18) > + > +#define ECCR_UE_CE_ADDR_HI_ROW_MASK GENMASK(10, 0) > + > +#define DDRMC5_MAX_ROW_CNT 18 > +#define DDRMC5_MAX_COL_CNT 11 > +#define DDRMC5_MAX_RANK_CNT 2 > +#define DDRMC5_MAX_LRANK_CNT 4 > +#define DDRMC5_MAX_BANK_CNT 2 > +#define DDRMC5_MAX_GRP_CNT 3 > + > +#define DDRMC5_REGHI_ROW 7 > +#define DDRMC5_EACHBIT 1 > +#define DDRMC5_ERR_TYPE_CE 0 > +#define DDRMC5_ERR_TYPE_UE 1 > +#define DDRMC5_HIGH_MEM_EN BIT(20) > +#define DDRMC5_MEM_MASK GENMASK(19, 0) > +#define DDRMC5_X16_BASE 256 > +#define DDRMC5_X16_ECC 32 > +#define DDRMC5_X16_SIZE (DDRMC5_X16_BASE + DDRMC5_X16_ECC) > +#define DDRMC5_X32_SIZE 576 > +#define DDRMC5_HIMEM_BASE (256 * SZ_1M) > +#define DDRMC5_ILC_HIMEM_EN BIT(28) > +#define DDRMC5_ILC_MEM GENMASK(27, 0) > +#define DDRMC5_INTERLEAVE_SEL GENMASK(3, 0) > +#define DDRMC5_BUS_WIDTH_MASK GENMASK(19, 18) > +#define DDRMC5_NUM_CHANS_MASK BIT(17) > +#define DDRMC5_RANK_MASK GENMASK(15, 14) > +#define DDRMC5_DWIDTH_MASK GENMASK(5, 4) > + > +#define AMD_MIN_BUF_LEN 0x28 > +#define AMD_ERROR_LEVEL 2 > +#define AMD_ERRORID 3 > +#define TOTAL_ERR_LENGTH 5 > +#define AMD_MSG_ERR_OFFSET 8 > +#define AMD_MSG_ERR_LENGTH 9 > +#define AMD_ERR_DATA 10 > +#define MCDI_RESPONSE 0xFF > + > +#define ERR_NOTIFICATION_MAX 96 > +#define REG_MAX 152 > +#define ADEC_MAX 152 > +#define NUM_CONTROLLERS 8 > +#define REGS_PER_CONTROLLER 19 > +#define ADEC_NUM 19 > +#define MC_CMD_EDAC_GET_OVERALL_DDR_CONFIG 2 > +#define BUFFER_SZ 80 > + > +#define XDDR5_BUS_WIDTH_64 0 > +#define XDDR5_BUS_WIDTH_32 1 > +#define XDDR5_BUS_WIDTH_16 2 > + > +/** > + * struct ecc_error_info - ECC error log information. > + * @burstpos: Burst position. > + * @lrank: Logical Rank number. > + * @rank: Rank number. > + * @group: Group number. > + * @bank: Bank number. > + * @col: Column number. > + * @row: Row number. > + * @rowhi: Row number higher bits. > + * @i: ECC error info. > + */ > +union ecc_error_info { > + struct { > + u32 burstpos:3; > + u32 lrank:4; > + u32 rank:2; > + u32 group:3; > + u32 bank:2; > + u32 col:11; > + u32 row:7; > + u32 rowhi; > + }; > + u64 i; > +} __packed; > + > +union edac_info { What is an "edac_info"? > + struct { > + u32 row0:6; > + u32 row1:6; > + u32 row2:6; > + u32 row3:6; > + u32 row4:6; > + u32 reserved:2; > + }; > + struct { > + u32 col1:6; > + u32 col2:6; > + u32 col3:6; > + u32 col4:6; > + u32 col5:6; > + u32 reservedcol:2; > + }; > + u32 i; > +} __packed; > + > +/** > + * struct ack - Acknowledgment information to report. > + * @error_level: Error level. > + * @error_id: Error id ectable error log information. Unknown word [ectable] in comment. Suggestions: ['eatable', 'editable', 'equatable', 'equitable', 'electable', 'equable', 'educable', 'octal', 'vegetable', 'equitably', 'edible', 'uneatable', 'quotable'] Use a spellchecker please. > + * @next_action: Next action to be taken. > + */ > +struct ack { > + u32 error_level; > + u32 error_id; > + u32 next_action; > +}; > + > +/** > + * struct ecc_status - ECC status information to report. > + * @ceinfo: Correctable error log information. > + * @ueinfo: Uncorrectable error log information. > + * @channel: Channel number. > + * @error_type: Error type information. > + */ > +struct ecc_status { > + union ecc_error_info ceinfo[2]; > + union ecc_error_info ueinfo[2]; > + u8 channel; > + u8 error_type; > +}; > + > +/** > + * struct edac_priv - DDR memory controller private instance data. So it pertains to the memory controller not to EDAC. So call it that. > + * @message: Buffer for framing the event specific info. > + * @ce_cnt: Correctable error count. > + * @ue_cnt: UnCorrectable error count. > + * @stat: ECC status information. > + * @lrank_bit: Bit shifts for lrank bit. > + * @rank_bit: Bit shifts for rank bit. > + * @row_bit: Bit shifts for row bit. > + * @col_bit: Bit shifts for column bit. > + * @bank_bit: Bit shifts for bank bit. > + * @grp_bit: Bit shifts for group bit. > + * @error_id: The error id. > + * @error_level: The error level. > + * @dwidth: The dwidth. Say what now? What is the "dwidth"? > + * @part_len: The subpart of the message received. > + * @regs: The registers sent on the rpmsg. > + * @adec: Address decode registers. > + * @mci: Memory controller interface. > + * @ept: rpmsg endpoint. > + * @mcdi: The mcdi handle. > + * @work: The work queue. > + * @xfer_done: The completion indicator for the rpmsg. > + */ > +struct edac_priv { > + char message[DDRMC5_EDAC_MSG_SIZE]; > + u32 ce_cnt; > + u32 ue_cnt; > + struct ecc_status stat; > + u32 lrank_bit[4]; > + u32 rank_bit[2]; > + u32 row_bit[18]; > + u32 col_bit[11]; > + u32 bank_bit[2]; > + u32 grp_bit[3]; > + u32 error_id; > + u32 error_level; > + enum dev_type dwidth; > + u32 part_len; > + u32 regs[REG_MAX]; > + u32 adec[ADEC_MAX]; > + struct mem_ctl_info *mci; > + struct rpmsg_endpoint *ept; > + struct cdx_mcdi *mcdi; > + struct work_struct work; > + struct completion xfer_done; > +}; > + > +enum error_ids { I'm not sure those are really needed: they're used once and you can just as well use the naked numbers with comments above them in the switch-case. For example: /* FPX SPLITTER error */ case 132 ... 139: snprintf(edac_priv->message, DDRMC5_EDAC_MSG_SIZE, "Error type: FPX SPLITTER Error %d", error_id); and so on. > + BOOT_CR = 0, > + BOOT_NCR = 1, > + FW_CR = 2, > + FW_NCR = 3, > + GSW_CR = 4, > + GSW_NCR = 5, > + CFU = 6, > + CFRAME = 7, > + PSM_CR = 8, > + PSM_NCR = 9, > + DDRMC5_MB_CR = 10, > + DDRMC5_MB_NCR = 11, > + NOCTYPE_CR = 12, > + NOCTYPE_NCR = 13, > + NOCUSER = 14, > + MMCM = 15, > + HNICX_CR = 16, > + HNICX_NCR = 17, > + DDRMC5_CR = 18, > + DDRMC5_NCR = 19, > + GT_CR = 20, > + GT_NCR = 21, > + PLSYSMON_CR = 22, > + PLSYSMON_NCR = 23, > + PL0 = 24, > + PL1 = 25, > + PL2 = 26, > + PL3 = 27, > + NPI_ROOT = 28, > + SSIT_ERR3 = 29, > + SSIT_ERR4 = 30, > + SSIT_ERR5 = 31, > + PMC_APB = 32, > + PMC_ROM = 33, > + MB_FATAL0 = 34, > + MB_FATAL1 = 35, > + PMC_CR = 36, > + PMC_NCR = 37, > + PMC_SMON0 = 39, > + PMC_SMON1 = 40, > + PMC_SMON2 = 41, > + PMC_SMON3 = 42, > + PMC_SMON4 = 43, > + PMC_SMON8 = 47, > + PMC_SMON9 = 48, > + CFI = 49, > + CFRAME_SEU_CRC = 50, > + CFRAME_SEU_ECC = 51, > + PMX_WWDT = 52, > + RTC_ALARM = 54, > + NPLL = 55, > + PPLL = 56, > + CLK_MON = 57, > + INT_PMX_CORR_ERR = 59, > + INT_PMX_UNCORR_ERR = 60, > + SSIT_ERR0 = 61, > + SSIT_ERR1 = 62, > + SSIT_ERR2 = 63, > + IOU_CR = 64, > + IOU_NCR = 65, > + DFX_UXPT_ACT = 66, > + DICE_CDI_PAR = 67, > + DEVIK_PRIV = 68, > + NXTSW_CDI_PAR = 69, > + DEVAK_PRIV = 70, > + DME_PUB_X = 71, > + DME_PUB_Y = 72, > + DEVAK_PUB_X = 73, > + DEVAK_PUB_Y = 74, > + DEVIK_PUB_X = 75, > + DEVIK_PUB_Y = 76, > + PCR_PAR = 77, > + PS_SW_CR = 96, > + PS_SW_NCR = 97, > + PSM_B_CR = 98, > + PSM_B_NCR = 99, > + MB_FATAL = 100, > + PSMX_CHK = 103, > + APLL1_LOCK = 104, > + APLL2_LOCK = 105, > + RPLL_LOCK = 106, > + FLXPLL_LOCK = 107, > + INT_PSM_CR = 108, > + INT_PSM_NCR = 109, > + USB_ERR = 110, > + LPX_DFX = 111, > + INT_LPD_CR = 113, > + INT_LPD_NCR = 114, > + INT_OCM_CR = 115, > + INT_OCM_NCR = 116, > + INT_FPD_CR = 117, > + INT_FPD_NCR = 118, > + INT_IOU_CR = 119, > + INT_IOU_NCR = 120, > + RPU_CLUSTERA_LS = 121, > + RPU_CLUSTERB_LS = 122, > + GIC_AXI = 123, > + GIC_ECC = 124, > + CPM_CR = 125, > + CPM_NCR = 126, > + FPD_CPI = 127, > + FPD_SWDT0 = 128, > + FPD_SWDT1 = 129, > + FPD_SWDT2 = 130, > + FPD_SWDT3 = 131, > + FPX_SPLITTER0_MEM_ERR = 132, > + FPX_SPLITTER0_AXI_ERR = 133, > + FPX_SPLITTER1_MEM_ERR = 134, > + FPX_SPLITTER1_AXI_ERR = 135, > + FPX_SPLITTER2_MEM_ERR = 136, > + FPX_SPLITTER2_AXI_ERR = 137, > + FPX_SPLITTER3_MEM_ERR = 138, > + FPX_SPLITTER3_AXI_ERR = 139, > + APU0 = 140, > + APU1 = 141, > + APU2 = 142, > + APU3 = 143, > + LPX_WWDT0 = 144, > + LPX_WWDT1 = 145, > + ADMA_LS_ERR = 146, > + IPI_ERR = 147, > + OCM0_CORR_ERR = 148, > + OCM1_CORR_ERR = 149, > + OCM0_UNCORR_ERR = 150, > + OCM1_UNCORR_ERR = 151, > + LPX_AFIFS_CORR_ERR = 152, > + LPX_AFIFS_UNCORR_ERR = 153, > + LPX_GLITCH_DET0 = 154, > + LPX_GLITCH_DET1 = 155, > + NOC_NMU_FIREWALL_WR_ERR = 156, > + NOC_NMU_FIREWALL_RD_ERR = 157, > + NOC_NSU_FIREWALL_ERR = 158, > + RPUA0_CORR_ERR = 163, > + RPUA0_MISC1 = 166, > + RPUA0_MISC2 = 167, > + RPUA1_CORR_ERR = 168, > + RPUA1_FATAL_ERR = 169, > + RPUA1_TIMEOUT_ERR = 170, > + RPUA1_MISC1 = 171, > + RPUA1_MISC2 = 172, > + RPUB0_CORR_ERR = 173, > + RPUB0_FATAL_ERR = 174, > + RPUB0_TIMEOUT_ERR = 175, > + RPUB0_MISC1 = 176, > + RPUB0_MISC2 = 177, > + RPUB1_CORR_ERR = 178, > + RPUB1_FATAL_ERR = 179, > + RPUB1_TIMEOUT_ERR = 180, > + RPUB1_MISC1 = 181, > + RPUB1_MISC2 = 182, > + RPU_PCIL_ERR = 184, > + FPX_AFIFS_CORR_ERR = 185, > + FPX_AFIFS_UNCORR_ERR = 186, > + FPD_CMN_1_ERR = 187, > + FPD_CMN_2_ERR = 188, > + FPD_CMN_3_ERR = 189, > + FPD_CML_ERR = 190, > + FPD_INTWRAP_ERR = 191, > + FPD_REST_MON_ERR = 192, > + LPD_MON_ERR = 193, > + AFIFM_FATAL_ERR = 194, > + LPX_AFIFM_NONFATAL_ERR = 195, > + FPX_AFIFM0_NONFATAL_ERR = 196, > + FPX_AFIFM1_NONFATAL_ERR = 197, > + FPX_AFIFM2_NONFATAL_ERR = 198, > + FPX_AFIFM3_NONFATAL_ERR = 199, > + RPU_CLUSTERA_ERR = 200, > + RPU_CLUSTERB_ERR = 201, > + HB_MON_0 = 224, > + HB_MON_1 = 225, > + HB_MON_2 = 226, > + HB_MON_3 = 227, > + PLM_EXCEPTION = 228, > + PCR_LOG_UPDATE = 230, > + ERROR_CRAM_CE = 231, > + ERROR_CRAM_UE = 232, > + ERROR_NPI_UE = 233, > +}; > + > +enum adec_info { What is "ADEC"? Some permutaion of "EDAC"? > + CONF = 0, > + ADEC0, > + ADEC1, > + ADEC2, > + ADEC3, > + ADEC4, > + ADEC5, > + ADEC6, > + ADEC7, > + ADEC8, > + ADEC9, > + ADEC10, > + ADEC11, > + ADEC12, > + ADEC13, > + ADEC14, > + ADEC15, > + ADEC16, > + ADECILC, > +}; > + > +enum reg_info { > + ISR = 0, > + IMR, > + ECCR0_ERR_STATUS, > + ECCR0_ADDR_LO, > + ECCR0_ADDR_HI, > + ECCR0_DATA_LO, > + ECCR0_DATA_HI, > + ECCR0_PAR, > + ECCR1_ERR_STATUS, > + ECCR1_ADDR_LO, > + ECCR1_ADDR_HI, > + ECCR1_DATA_LO, > + ECCR1_DATA_HI, > + ECCR1_PAR, > + XMPU_ERR, > + XMPU_ERR_ADDR_L0, > + XMPU_ERR_ADDR_HI, > + XMPU_ERR_AXI_ID, > + ADEC_CHK_ERR_LOG, > +}; > + > +static void get_ddr_ce_error_info(u32 *error_data, struct edac_priv *priv) > +{ > + u32 regval, par, reghi; "parity" is still short enough but a lot more understandable. > + struct ecc_status *p; > + > + p = &priv->stat; > + > + regval = error_data[ECCR0_ADDR_HI]; > + reghi = regval & ECCR_UE_CE_ADDR_HI_ROW_MASK; So "regval" should be "reglo"? > + regval = error_data[ECCR0_ADDR_LO]; > + p->ceinfo[0].i = regval | (u64)reghi << 32; > + > + par = error_data[ECCR0_PAR]; > + edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n", > + reghi, regval, par); > + > + regval = error_data[ECCR1_ADDR_LO]; > + reghi = error_data[ECCR1_ADDR_HI]; > + p->ceinfo[1].i = regval | (u64)reghi << 32; > + > + par = error_data[ECCR1_PAR]; > + edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n", > + reghi, regval, par); > +} > + > +static void get_ddr_ue_error_info(u32 error_data[REGS_PER_CONTROLLER], struct edac_priv *priv) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What is that for? > +{ > + u32 regval, par, reghi; > + struct ecc_status *p; > + > + p = &priv->stat; > + > + regval = error_data[ECCR0_ADDR_LO]; > + reghi = error_data[ECCR0_ADDR_HI]; > + par = error_data[ECCR0_PAR]; > + > + p->ueinfo[0].i = regval | (u64)reghi << 32; > + > + edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n", > + reghi, regval, par); > + > + regval = error_data[ECCR1_ADDR_LO]; > + reghi = error_data[ECCR1_ADDR_HI]; > + p->ueinfo[1].i = regval | (u64)reghi << 32; > + par = error_data[ECCR1_PAR]; > + > + edac_dbg(2, "ERR DATA: 0x%08X%08X ERR DATA PARITY: 0x%08X\n", > + reghi, regval, par); > +} Same comments as for get_ddr_ce_error_info(). Looking at those functions, you can actually merge them into one. > +static bool get_ddr_ue_info(u32 error_data[REGS_PER_CONTROLLER], struct edac_priv *priv) > +{ > + u32 eccr0_val, eccr1_val, isr; > + struct ecc_status *p; > + > + p = &priv->stat; > + > + isr = error_data[ISR]; > + if (!(isr & DDRMC5_IRQ_UE_MASK)) > + return false; > + > + eccr0_val = error_data[ECCR0_ERR_STATUS]; > + eccr1_val = error_data[ECCR1_ERR_STATUS]; > + > + if (!eccr0_val && !eccr1_val) > + return false; > + > + if (!eccr0_val) > + p->channel = 1; > + else > + p->channel = 0; > + > + get_ddr_ue_error_info(error_data, priv); > + > + return true; > +} > + > +static bool get_ddr_ce_info(u32 error_data[REGS_PER_CONTROLLER], struct edac_priv *priv) > +{ > + u32 eccr0_val, eccr1_val, isr; > + struct ecc_status *p; > + > + p = &priv->stat; > + > + isr = error_data[ISR]; > + if (!(isr & DDRMC5_IRQ_CE_MASK)) > + return false; > + > + eccr0_val = error_data[ECCR0_ERR_STATUS]; > + eccr1_val = error_data[ECCR1_ERR_STATUS]; > + > + if (!eccr0_val && !eccr1_val) > + return false; > + > + if (!eccr0_val) > + p->channel = 1; > + else > + p->channel = 0; > + > + get_ddr_ce_error_info(error_data, priv); > + > + return true; > +} Also unify into a single function. So basically the above 4 could be a single function. > + > +/** > + * handle_error - Handle Correctable and Uncorrectable errors. > + * @priv: DDR memory controller private instance data. > + * @stat: ECC status structure. > + * @controller: Controller number of the DDRMC5 > + * > + * Handles ECC correctable and uncorrectable errors. > + */ > +static void handle_error(struct edac_priv *priv, struct ecc_status *stat, int controller) > +{ > + struct mem_ctl_info *mci = priv->mci; > + union ecc_error_info pinf; > + unsigned long pa; > + phys_addr_t pfn; > + int err; > + > + if (stat->error_type == DDRMC5_ERR_TYPE_CE) { > + priv->ce_cnt++; > + pinf = stat->ceinfo[stat->channel]; > + snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE, > + "Error type:%s Addr at %lx\n", > + "CE", convert_to_physical(priv, pinf, controller)); > + > + edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci, > + 1, 0, 0, 0, 0, 0, -1, > + priv->message, ""); > + } > + > + if (stat->error_type == DDRMC5_ERR_TYPE_UE) { > + priv->ue_cnt++; > + pinf = stat->ueinfo[stat->channel]; > + snprintf(priv->message, DDRMC5_EDAC_MSG_SIZE, > + "Error type:%s Addr at %lx\n", > + "UE", convert_to_physical(priv, pinf, controller)); > + > + edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci, > + 1, 0, 0, 0, 0, 0, -1, > + priv->message, ""); > + pa = convert_to_physical(priv, pinf, controller); > + pfn = PHYS_PFN(pa); > + > + if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) { > + err = memory_failure(pfn, MF_ACTION_REQUIRED); > + if (err) > + edac_dbg(2, "In fail of memory_failure %d\n", err); > + else > + edac_dbg(2, "Page at PA 0x%lx is hardware poisoned\n", pa); > + } > + } > + > + memset(stat, 0, sizeof(*stat)); This is the wrong ordering - you either clear it on entry or the caller should clear it - the caller should not rely on this function clearing it for the next function it is going to call. > + > +/** > + * mc_init - Initialize one driver instance. > + * @mci: EDAC memory controller instance. > + * @pdev: platform device. > + * > + * Perform initialization of the EDAC memory controller instance and > + * related driver-private data associated with the memory controller the > + * instance is bound to. The stuff which needs commenting doesn't have any but the obvious ones have comments which are more than necessary. :-\ > + */ > +static void mc_init(struct mem_ctl_info *mci, struct platform_device *pdev) > +{ > + mci->pdev = &pdev->dev; > + platform_set_drvdata(pdev, mci); > + > + /* Initialize controller capabilities and configuration */ > + mci->mtype_cap = MEM_FLAG_DDR5; > + mci->edac_ctl_cap = EDAC_FLAG_NONE | EDAC_FLAG_SECDED; > + mci->scrub_cap = SCRUB_HW_SRC; > + mci->scrub_mode = SCRUB_NONE; > + > + mci->edac_cap = EDAC_FLAG_SECDED; > + mci->ctl_name = "amd_ddr_controller"; > + mci->dev_name = dev_name(&pdev->dev); > + mci->mod_name = "amd_edac"; Do: git grep mod_name drivers/edac/ to get an idea how those names are chosen. > + edac_op_state = EDAC_OPSTATE_INT; > + > + init_csrows(mci); > +} > + > +#define to_mci(k) container_of(k, struct mem_ctl_info, dev) > + > +static int amd_rpmsg_send(struct cdx_mcdi *cdx_mcdi, > + const struct cdx_dword *hdr, size_t hdr_len, > + const struct cdx_dword *sdu, size_t sdu_len) Used only once - fold it into the call site. And static functions don't need silly name prefixes like "amd_". > +{ > + unsigned char *send_buf; > + int ret; > + > + send_buf = kzalloc(hdr_len + sdu_len, GFP_KERNEL); > + if (!send_buf) > + return -ENOMEM; > + > + memcpy(send_buf, hdr, hdr_len); > + memcpy(send_buf + hdr_len, sdu, sdu_len); > + > + ret = rpmsg_send(cdx_mcdi->ept, send_buf, hdr_len + sdu_len); > + kfree(send_buf); > + return ret; > +} > + > +static int get_ddr_config(u32 index, u32 *buffer, struct cdx_mcdi *amd_mcdi) > +{ > + size_t outlen; > + int ret; > + > + MCDI_DECLARE_BUF(inbuf, MC_CMD_EDAC_GET_DDR_CONFIG_IN_LEN); > + MCDI_DECLARE_BUF(outbuf, BUFFER_SZ); > + > + MCDI_SET_DWORD(inbuf, EDAC_GET_DDR_CONFIG_IN_CONTROLLER_INDEX, index); > + > + ret = cdx_mcdi_rpc(amd_mcdi, MC_CMD_EDAC_GET_DDR_CONFIG, inbuf, sizeof(inbuf), > + outbuf, sizeof(outbuf), &outlen); > + if (ret) > + return ret; > + memcpy(buffer, MCDI_PTR(outbuf, EDAC_GET_DDR_CONFIG_OUT_REGISTER_VALUES), (ADEC_NUM * 4)); > + return 0; Function returning a value which no one uses. > +} > + > +static int amd_setup_mcdi(struct edac_priv *edac_priv) > +{ > + struct cdx_mcdi *amd_mcdi; > + int ret, i; > + > + amd_mcdi = kzalloc(sizeof(*amd_mcdi), GFP_KERNEL); > + if (!amd_mcdi) > + return -ENOMEM; > + > + /* Store the MCDI ops */ Useless comment. > + amd_mcdi->mcdi_ops = &mcdi_ops; > + /* MCDI FW: Initialize the FW path */ Ditto. > + ret = cdx_mcdi_init(amd_mcdi); > + if (ret) > + return ret; And here you leaked amd_mcdi when you returned. > + amd_mcdi->ept = edac_priv->ept; > + edac_priv->mcdi = amd_mcdi; > + > + for (i = 0; i < NUM_CONTROLLERS; i++) > + get_ddr_config(i, &edac_priv->adec[ADEC_NUM * i], amd_mcdi); > + > + complete(&edac_priv->xfer_done); That looks like a hack. > + return 0; Ditto. > +} > + > +static inline void process_bit(struct edac_priv *priv, unsigned int start, u32 regval) You don't need "inline" - the compiler can decide that itself. And "process_bit" needs a better name. > +{ > + union edac_info rows; > + > + rows.i = regval; > + priv->row_bit[start] = rows.row0; > + priv->row_bit[start + 1] = rows.row1; > + priv->row_bit[start + 2] = rows.row2; > + priv->row_bit[start + 3] = rows.row3; > + priv->row_bit[start + 4] = rows.row4; > +} > + > +static void setup_row_address_map(struct edac_priv *priv, u32 *error_data) > +{ > + union edac_info rows; > + u32 regval; > + > + regval = error_data[ADEC6]; > + process_bit(priv, 0, regval); > + > + regval = error_data[ADEC7]; > + process_bit(priv, 5, regval); > + > + regval = error_data[ADEC8]; > + process_bit(priv, 10, regval); > + > + regval = error_data[ADEC9]; > + rows.i = regval; > + > + priv->row_bit[15] = rows.row0; > + priv->row_bit[16] = rows.row1; > + priv->row_bit[17] = rows.row2; > +} > + > +static void setup_column_address_map(struct edac_priv *priv, u32 *error_data) > +{ > + union edac_info cols; > + u32 regval; > + > + regval = error_data[ADEC9]; > + priv->col_bit[0] = FIELD_GET(MASK_24, regval); > + > + regval = error_data[ADEC10]; > + cols.i = regval; > + priv->col_bit[1] = cols.col1; > + priv->col_bit[2] = cols.col2; > + priv->col_bit[3] = cols.col3; > + priv->col_bit[4] = cols.col4; > + priv->col_bit[5] = cols.col5; > + > + regval = error_data[ADEC11]; > + cols.i = regval; > + priv->col_bit[6] = cols.col1; > + priv->col_bit[7] = cols.col2; > + priv->col_bit[8] = cols.col3; > + priv->col_bit[9] = cols.col4; > + priv->col_bit[10] = cols.col5; > +} Why are those functions copying stuff around? Why aren't you using cols directly? > +static inline bool is_response(u8 *data) > +{ > + return (data[0] == MCDI_RESPONSE); > +} Zap that silly function. > + > +static struct rpmsg_device_id amd_rpmsg_id_table[] = { > + { .name = "error_ipc" }, > + { }, > +}; > +MODULE_DEVICE_TABLE(rpmsg, amd_rpmsg_id_table); > + > +static void amd_rpmsg_post_probe_work(struct work_struct *work) > +{ > + struct edac_priv *priv; > + > + priv = container_of(work, struct edac_priv, work); > + amd_setup_mcdi(priv); > +} Why is probing a work item? Explaining *that* is what a commit message is for - not for repeating useless info. > +static int amd_rpmsg_probe(struct rpmsg_device *rpdev) > +{ > + struct rpmsg_channel_info chinfo = {0}; > + struct edac_priv *pg; > + > + pg = (struct edac_priv *)amd_rpmsg_id_table[0].driver_data; > + chinfo.src = RPMSG_ADDR_ANY; > + chinfo.dst = rpdev->dst; /* NMC */ verify_comment_style: WARNING: No tail comments please: drivers/edac/versalnet_rpmsg_edac.c:1139 [+ chinfo.dst = rpdev->dst; /* NMC */] Check your whole driver. > + strscpy(chinfo.name, amd_rpmsg_id_table[0].name, > + strlen(amd_rpmsg_id_table[0].name)); > + > + pg->ept = rpmsg_create_ept(rpdev, amd_rpmsg_cb, NULL, chinfo); > + if (!pg->ept) > + return dev_err_probe(&rpdev->dev, -ENXIO, > + "Failed to create ept for channel %s\n", > + chinfo.name); > + > + dev_set_drvdata(&rpdev->dev, pg); > + > + schedule_work(&pg->work); > + > + return 0; > +} > + > +static int mc_probe(struct platform_device *pdev) > +{ > + unsigned long time_left, wait_jiffies; > + u32 num_chans, rank, dwidth, config; > + struct device_node *r5_core_node; > + struct edac_mc_layer layers[2]; > + struct mem_ctl_info *mci; > + struct edac_priv *priv; > + struct rproc *rp; > + enum dev_type dt; > + int rc, i; > + > + r5_core_node = of_parse_phandle(pdev->dev.of_node, "amd,rproc", 0); > + if (!r5_core_node) { > + dev_err(&pdev->dev, "amd,rproc: invalid phandle\n"); > + return -EINVAL; > + } > + > + rp = rproc_get_by_phandle(r5_core_node->phandle); > + if (!rp) > + return -EPROBE_DEFER; > + > + rc = rproc_boot(rp); > + if (rc) { > + dev_err(&pdev->dev, "Failed to attach to remote processor\n"); > + rproc_put(rp); > + return rc; > + } > + > + priv = kzalloc(sizeof(*priv), GFP_KERNEL); > + amd_rpmsg_id_table[0].driver_data = (kernel_ulong_t)priv; > + INIT_WORK(&priv->work, amd_rpmsg_post_probe_work); > + init_completion(&priv->xfer_done); > + rc = register_rpmsg_driver(&amd_rpmsg_driver); > + if (rc) { > + edac_printk(KERN_ERR, EDAC_MC, > + "Failed to register RPMsg driver: %d\n", rc); > + goto free_rproc; > + } > + wait_jiffies = msecs_to_jiffies(10000); > + time_left = wait_for_completion_timeout(&priv->xfer_done, wait_jiffies); > + if (time_left == 0) > + goto free_rpmsg; Yah, this needs explanation as to why it is there. > + for (i = 0; i < NUM_CONTROLLERS; i++) { > + config = priv->adec[CONF + i]; > + num_chans = FIELD_GET(DDRMC5_NUM_CHANS_MASK, config); > + rank = FIELD_GET(DDRMC5_RANK_MASK, config); > + rank = 1 << rank; > + dwidth = FIELD_GET(DDRMC5_BUS_WIDTH_MASK, config); > + dt = get_dwidth(dwidth); > + if (dt != DEV_UNKNOWN) > + break; > + } What is that loop supposed to do? Find the last controller before the one with DEV_UNKNOWN device width and register that one? > + layers[0].type = EDAC_MC_LAYER_CHIP_SELECT; > + layers[0].size = rank; > + layers[0].is_virt_csrow = true; > + layers[1].type = EDAC_MC_LAYER_CHANNEL; > + layers[1].size = num_chans; > + layers[1].is_virt_csrow = false; > + > + mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, > + sizeof(struct edac_priv)); > + if (!mci) { > + edac_printk(KERN_ERR, EDAC_MC, > + "Failed memory allocation for mc instance\n"); > + rc = -ENOMEM; > + goto free_rproc; > + } > + priv->mci = mci; > + > + priv->dwidth = dt; > + mc_init(mci, pdev); > + rc = edac_mc_add_mc(mci); > + if (rc) { > + edac_printk(KERN_ERR, EDAC_MC, > + "Failed to register with EDAC core\n"); > + goto free_edac_mc; > + } Yeah, in any case, this needs a lot more explanation how all the parts are supposed to work together. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette