[AMD Official Use Only] > -----Original Message----- > From: Borislav Petkov <bp@xxxxxxxxx> > Sent: Wednesday, September 22, 2021 7:41 AM > To: Joshi, Mukul <Mukul.Joshi@xxxxxxx> > Cc: linux-edac@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > mingo@xxxxxxxxxx; mchehab@xxxxxxxxxx; Ghannam, Yazen > <Yazen.Ghannam@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran > RAS > > [CAUTION: External Email] > > On Sun, Sep 12, 2021 at 10:13:11PM -0400, Mukul Joshi wrote: > > On Aldebaran, GPU driver will handle bad page retirement even though > > UMC is host managed. As a result, register a bad page retirement > > handler on the mce notifier chain to retire bad pages on Aldebaran. > > > > v1->v2: > > - Use smca_get_bank_type() to determine MCA bank. > > - Envelope the changes under #ifdef CONFIG_X86_MCE_AMD. > > - Use MCE_PRIORITY_UC instead of MCE_PRIO_ACCEL as we are > > only handling uncorrectable errors. > > - Use macros to determine UMC instance and channel instance > > where the uncorrectable error occured. > > - Update the headline. > > Same note as for the previous patch. > Acked. > > +static int amdgpu_bad_page_notifier(struct notifier_block *nb, > > + unsigned long val, void *data) { > > + struct mce *m = (struct mce *)data; > > + struct amdgpu_device *adev = NULL; > > + uint32_t gpu_id = 0; > > + uint32_t umc_inst = 0; > > + uint32_t ch_inst, channel_index = 0; > > + struct ras_err_data err_data = {0, 0, 0, NULL}; > > + struct eeprom_table_record err_rec; > > + uint64_t retired_page; > > + > > + /* > > + * If the error was generated in UMC_V2, which belongs to GPU UMCs, > > + * and error occurred in DramECC (Extended error code = 0) then only > > + * process the error, else bail out. > > + */ > > + if (!m || !((smca_get_bank_type(m->bank) == SMCA_UMC_V2) && > > + (XEC(m->status, 0x1f) == 0x0))) > > + return NOTIFY_DONE; > > + > > + /* > > + * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register. > > + */ > > + gpu_id = GET_MCA_IPID_GPUID(m->ipid) - GPU_ID_OFFSET; > > + > > + adev = find_adev(gpu_id); > > + if (!adev) { > > + dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n", > > + __func__, gpu_id); > > + return NOTIFY_DONE; > > + } > > + > > + /* > > + * If it is correctable error, return. > > + */ > > + if (mce_is_correctable(m)) { > > + return NOTIFY_OK; > > + } > > This can run before you find_adev(). > Acked. > > +static void amdgpu_register_bad_pages_mca_notifier(void) > > +{ > > + /* > > + * Register the x86 notifier only once > > + * with MCE subsystem. > > + */ > > + if (notifier_registered == false) { > > + mce_register_decode_chain(&amdgpu_bad_page_nb); > > + notifier_registered = true; > > + } > > I have a patchset which will get rid of the need to do this silliness - only if I had > some time to actually prepare it for submission... :-\ > :-\ Thank you. Regards, Mukul > -- > Regards/Gruss, > Boris. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople. > kernel.org%2Ftglx%2Fnotes-about- > netiquette&data=04%7C01%7Cmukul.joshi%40amd.com%7C7ae9c87153f7 > 4572712908d97dbdcc7a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0 > %7C637679076423976867%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw > MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata > =M5UDrSea0lnEi5%2BhU4ck0zk1dZD9kX4DUoXt95J6dJ4%3D&reserved=0