[AMD Official Use Only] > -----Original Message----- > From: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx> > Sent: Thursday, September 23, 2021 10:29 AM > To: Joshi, Mukul <Mukul.Joshi@xxxxxxx> > Cc: linux-edac@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > bp@xxxxxxxxx; mingo@xxxxxxxxxx; mchehab@xxxxxxxxxx; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran > RAS > > On Wed, Sep 22, 2021 at 03:36:20PM -0400, Mukul Joshi wrote: > > On Aldebaran, GPU driver will handle bad page retirement even though > > UMC is host managed. As a result, register a bad page retirement > > handler on the mce notifier chain to retire bad pages on Aldebaran. > > > > I think this should state that the driver will do page retirement for GPU-managed > memory. As written, it implies that the driver do page retirement in general for > the system. > ACK. I will update the description. > ... > > > + > > +static int amdgpu_bad_page_notifier(struct notifier_block *nb, > > + unsigned long val, void *data) { > > + struct mce *m = (struct mce *)data; > > + struct amdgpu_device *adev = NULL; > > + uint32_t gpu_id = 0; > > + uint32_t umc_inst = 0; > > + uint32_t ch_inst, channel_index = 0; > > + struct ras_err_data err_data = {0, 0, 0, NULL}; > > + struct eeprom_table_record err_rec; > > + uint64_t retired_page; > > + > > + /* > > + * If the error was generated in UMC_V2, which belongs to GPU UMCs, > > + * and error occurred in DramECC (Extended error code = 0) then only > > + * process the error, else bail out. > > + */ > > + if (!m || !((smca_get_bank_type(m->bank) == SMCA_UMC_V2) && > > + (XEC(m->status, 0x1f) == 0x0))) > > The MCA_STATUS[ErrorCodeExt] field is bits [21:16], so the mask should be > 0x3f. Ack. Thanks for catching this. > > > + return NOTIFY_DONE; > > + > > + /* > > + * If it is correctable error, return. > > + */ > > + if (mce_is_correctable(m)) > > + return NOTIFY_OK; > > Shouldn't this be "NOTIFY_DONE" if "don't care" about this error? The thinking is we want to stop calling further consumers since it's a correctable error in GPU UMC and we are not taking any action about the correctable errors. Thanks, Mukul > > Thanks, > Yazen