RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

"Joshi, Mukul" <Mukul.Joshi@xxxxxxx> · Wed, 12 May 2021 19:00:58 +0000

[AMD Official Use Only - Internal Distribution Only]

> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Wednesday, May 12, 2021 5:37 AM
> To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>; x86-ml <x86@xxxxxxxxxx>; lkml <linux-
> kernel@xxxxxxxxxxxxxxx>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> Hi,
> 
> so this is a drive-by review using the lore.kernel.org mail because I wasn't CCed
> on this.
> 
> On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > +                                 unsigned long val, void *data) {
> > +     struct mce *m = (struct mce *)data;
> > +     struct amdgpu_device *adev = NULL;
> > +     uint32_t gpu_id = 0;
> > +     uint32_t umc_inst = 0;
> > +     uint32_t chan_index = 0;
> > +     struct ras_err_data err_data = {0, 0, 0, NULL};
> > +     struct eeprom_table_record err_rec;
> > +     uint64_t retired_page;
> > +
> > +     /*
> > +      * If the error was generated in UMC_V2, which belongs to GPU
> > + UMCs,
> 
> Why does it matter if the error is a v2 UMC generated one?
> 
> IOW, can you detect the type of errors in GPU memory - I assume this is what
> this is supposed to handle - from the error signature itself instead of doing
> is_smca_umc_v2?

SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is only interested in errors on GPU UMC.
We cannot know this without is_smca_umc_v2.

> 
> > +      * and error occurred in DramECC (Extended error code = 0) then only
> > +      * process the error, else bail out.
> > +      */
> > +     if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0)))
> > +             return NOTIFY_DONE;
> > +
> > +     gpu_id = GET_MCA_IPID_GPUID(m->ipid);
> > +
> > +     /*
> > +      * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> > +      */
> > +     gpu_id -= GPU_ID_OFFSET;
> > +
> > +     adev = find_adev(gpu_id);
> > +     if (!adev) {
> > +             dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> > +                                  __func__, gpu_id);
> > +             return NOTIFY_DONE;
> > +     }
> > +
> > +     /*
> > +      * If it is correctable error, then print a message and return.
> > +      */
> > +     if (mce_is_correctable(m)) {
> > +             dev_info(adev->dev, "%s: UMC Correctable error detected.",
> > +                                 __func__);
> 
> So put yourself in the shoes of a support engineer who starts getting all those
> calls about people's hardware getting correctable errors and should they replace
> it and should AMD RMA the GPUs and so on and so on..? Do you really wanna be
> on the receiving side of that call?
> 
> IOW, whom does printing the fact that the GPU had a corrected error which got
> corrected by the hardware, help and do you really want to upset people
> needlessly?

Agree. I will remove this debug print.

> 
> > +             return NOTIFY_OK;
> > +     }
> > +
> > +     /*
> > +      * If it is uncorrectable error, then find out UMC instance and
> > +      * channel index.
> > +      */
> > +     find_umc_inst_chan_index(m, &umc_inst, &chan_index);
> 
> That's a void function but it could return a pointer to the instance and channel
> bundled in a struct maybe...

Maybe. I hope its not too much of a concern if it stays the way it is.

> 
> > +
> > +     dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d,"
> > +                         " chan_idx: %d", umc_inst, chan_index);
> > +
> > +     memset(&err_rec, 0x0, sizeof(struct eeprom_table_record));
> > +
> > +     /*
> > +      * Translate UMC channel address to Physical address
> > +      */
> > +     retired_page = ADDR_OF_8KB_BLOCK(m->addr) |
> > +                     ADDR_OF_256B_BLOCK(chan_index) |
> > +                     OFFSET_IN_256B_BLOCK(m->addr);
> > +
> > +     err_rec.address = m->addr;
> > +     err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> > +     err_rec.ts = (uint64_t)ktime_get_real_seconds();
> > +     err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE;
> > +     err_rec.cu = 0;
> > +     err_rec.mem_channel = chan_index;
> > +     err_rec.mcumc_id = umc_inst;
> > +
> > +     err_data.err_addr = &err_rec;
> > +     err_data.err_addr_cnt = 1;
> > +
> > +     if (amdgpu_bad_page_threshold != 0) {
> > +             amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> > +                                             err_data.err_addr_cnt);
> > +             amdgpu_ras_save_bad_pages(adev);
> > +     }
> > +
> > +     return NOTIFY_OK;
> > +}
> > +
> > +static struct notifier_block amdgpu_bad_page_nb = {
> > +     .notifier_call  = amdgpu_bad_page_notifier,
> > +     .priority       = MCE_PRIO_ACCEL,
> 
> Nothing ever explains why this needs to be a separate priority. So how about it?

I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
are two different class of devices.
That is the reason I create a new one. 
I am OK to use EDAC priority if that is acceptable.

> 
> > +};
> > +
> > +static void amdgpu_register_bad_pages_mca_notifier(void)
> > +{
> > +     /*
> > +      * Register the x86 notifier with MCE subsystem.
> > +      * Please note a notifier can be registered only once
> > +      * with the MCE subsystem.
> > +      */
> 
> Why do you need this? Other users don't need that. Do you need to call
> mce_unregister_decode_chain() when the driver gets removed or so?
> 
> > +     if (notifier_registered == false) {
> 
> This is just silly and should be fixed properly once the issue is understood.

A system can have multiple GPUs and we only want a single notifier registered.
I will change the comment to explicitly state this.

Thanks,
Mukul

> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C1c231207786
> 446018e7208d915297942%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564090158211378%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =7LSOnoJWrTf5z96YFACxuRKZP1z4E4zkvtrNzjbTaPs%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx