[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Borislav Petkov <bp@xxxxxxxxx> > Sent: Monday, March 3, 2025 11:25 PM > To: Datta, Shubhrajyoti <shubhrajyoti.datta@xxxxxxx> > Cc: Krzysztof Kozlowski <krzk@xxxxxxxxxx>; Rob Herring <robh@xxxxxxxxxx>; > Conor Dooley <conor+dt@xxxxxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>; James > Morse <james.morse@xxxxxxx>; Mauro Carvalho Chehab > <mchehab@xxxxxxxxxx>; Robert Richter <rric@xxxxxxxxxx>; linux- > kernel@xxxxxxxxxxxxxxx; devicetree@xxxxxxxxxxxxxxx; linux-edac@xxxxxxxxxxxxxxx; git > (AMD-Xilinx) <git@xxxxxxx> > Subject: Re: [PATCH v5 5/5] EDAC: Versal NET: Add support for error notification > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > On Thu, Feb 27, 2025 at 11:32:10AM +0000, Datta, Shubhrajyoti wrote: > > > > +union edac_info { > > > > > > What is an "edac_info"? > > This is the row and column positions. > > We are using it to extract the position from the address decoder registers. > > Needs a better name which is descriptive as to how it is used or what it represents. Agreed will try to name row_col_mapping or addr_decoder_info. > > > > > +static void get_ddr_ue_error_info(u32 > > > > +error_data[REGS_PER_CONTROLLER], > > > struct edac_priv *priv) > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > What is that for? > > > > > The error_data contains the register values. Linux does not have > > access to the DDRMC register Space. It queries it from the NMC and gets the the > data from the rpmsg callback. > > I wasn't clear. Arrays in C are passed as pointers - the compiler does that anyway. > You don't have to do this weird parameter specification. Will do thanks. > > > > > + mci->edac_cap = EDAC_FLAG_SECDED; > > > > + mci->ctl_name = "amd_ddr_controller"; > > > > + mci->dev_name = dev_name(&pdev->dev); > > > > + mci->mod_name = "amd_edac"; > > > > > > Do: > > > > > > git grep mod_name drivers/edac/ > > > > > > to get an idea how those names are chosen. > > #define EDAC_MOD_STR "r82600_edac" > > mci->mod_name = EDAC_MOD_STR; > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > /tree/drivers/edac/r82600_edac.c?h=v6.14-rc4#n316 > > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > /tree/drivers/edac/i5000_edac.c?h=v6.14-rc4#n1424 > > mci->mod_name = "i5000_edac.c"; > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > /tree/drivers/edac/highbank_mc_edac.c?h=v6.14-rc4#n218 > > mci->mod_name = pdev->dev.driver->name; > > > > let me know if mci->mod_name = pdev->dev.driver->name; is fine. > > I think you didn't get me again. > > amd64_edac.c - the x86 driver is called this: > > #define EDAC_MOD_STR "amd64_edac" > > Calling yours "amd_edac" doesn't work. > > "versalnet_edac"? That's probably better. Will do. > > > > You don't need "inline" - the compiler can decide that itself. And > > > "process_bit" needs a better name. > > > > Will rename it to populate_row_bit > > Or "assign_row_bit" or whatever. > > The function name should be describing what the function does as closely as > possible. > > > > Why are those functions copying stuff around? Why aren't you using > > > cols directly? > > > > The column bit position is used in converting to the physical address. > > We read it at init and use it every time an error occurs to get the address. > > Did you mean to remove the regval. Or read the error_data every time. > > I mean simply use cols instead of assigning stuff to priv->col_bit* and then using > that. Will do > > > > Why is probing a work item? > > > > > > Explaining *that* is what a commit message is for - not for > > > repeating useless info. > > The RPMsg probe is invoked from a thread within the virtio driver > > responsible for processing the response ring. If the probe initiates > > an mcdi API call, it blocks until the mcdi response is received. > > However, since the mcdi response is also processed by the same thread > > that triggered the rpmsg probe, the thread remains blocked, preventing it from > handling the response. This results in a deadlock. > > > > To prevent it we have a work scheduled. > > This is just insane. > > I don't see anything in amd_setup_mcdi() that needs some response from some > mcdi thing. If not, you don't need the work queue thing. The amd_setup_mcdi calls get_ddr_config this sends and rpc (calling cdx_mcdi_rpc) to get the ddr configuration. > > > > > > > > + for (i = 0; i < NUM_CONTROLLERS; i++) { > > > > + config = priv->adec[CONF + i]; > > > > + num_chans = FIELD_GET(DDRMC5_NUM_CHANS_MASK, > config); > > > > + rank = FIELD_GET(DDRMC5_RANK_MASK, config); > > > > + rank = 1 << rank; > > > > + dwidth = FIELD_GET(DDRMC5_BUS_WIDTH_MASK, config); > > > > + dt = get_dwidth(dwidth); > > > > + if (dt != DEV_UNKNOWN) > > > > + break; > > > > + } > > > > > > What is that loop supposed to do? Find the last controller before > > > the one with DEV_UNKNOWN device width and register that one? > > > > There are 8 controllers all we try to get the first one that is enabled and register > that one. > > We use the device unknown to know if that is enabled or not. > > The first one that is enabled has unknown device width? What? The loop iterates through all 8 controllers, checking their configuration registers. It looks for the first enabled controller by verifying if its device width (dt) is not DEV_UNKNOWN. Once it finds the first valid controller, it breaks out of the loop and registers that one > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette