[AMD Official Use Only - Internal Distribution Only] > -----Original Message----- > From: Borislav Petkov <bp@xxxxxxxxx> > Sent: Wednesday, May 12, 2021 5:37 AM > To: Joshi, Mukul <Mukul.Joshi@xxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kasiviswanathan, Harish > <Harish.Kasiviswanathan@xxxxxxx>; x86-ml <x86@xxxxxxxxxx>; lkml <linux- > kernel@xxxxxxxxxxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > Hi, > > so this is a drive-by review using the lore.kernel.org mail because I wasn't CCed > on this. > > On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote: > > +static int amdgpu_bad_page_notifier(struct notifier_block *nb, > > + unsigned long val, void *data) { > > + struct mce *m = (struct mce *)data; > > + struct amdgpu_device *adev = NULL; > > + uint32_t gpu_id = 0; > > + uint32_t umc_inst = 0; > > + uint32_t chan_index = 0; > > + struct ras_err_data err_data = {0, 0, 0, NULL}; > > + struct eeprom_table_record err_rec; > > + uint64_t retired_page; > > + > > + /* > > + * If the error was generated in UMC_V2, which belongs to GPU > > + UMCs, > > Why does it matter if the error is a v2 UMC generated one? > > IOW, can you detect the type of errors in GPU memory - I assume this is what > this is supposed to handle - from the error signature itself instead of doing > is_smca_umc_v2? SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is only interested in errors on GPU UMC. We cannot know this without is_smca_umc_v2. > > > + * and error occurred in DramECC (Extended error code = 0) then only > > + * process the error, else bail out. > > + */ > > + if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0))) > > + return NOTIFY_DONE; > > + > > + gpu_id = GET_MCA_IPID_GPUID(m->ipid); > > + > > + /* > > + * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register. > > + */ > > + gpu_id -= GPU_ID_OFFSET; > > + > > + adev = find_adev(gpu_id); > > + if (!adev) { > > + dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n", > > + __func__, gpu_id); > > + return NOTIFY_DONE; > > + } > > + > > + /* > > + * If it is correctable error, then print a message and return. > > + */ > > + if (mce_is_correctable(m)) { > > + dev_info(adev->dev, "%s: UMC Correctable error detected.", > > + __func__); > > So put yourself in the shoes of a support engineer who starts getting all those > calls about people's hardware getting correctable errors and should they replace > it and should AMD RMA the GPUs and so on and so on..? Do you really wanna be > on the receiving side of that call? > > IOW, whom does printing the fact that the GPU had a corrected error which got > corrected by the hardware, help and do you really want to upset people > needlessly? Agree. I will remove this debug print. > > > + return NOTIFY_OK; > > + } > > + > > + /* > > + * If it is uncorrectable error, then find out UMC instance and > > + * channel index. > > + */ > > + find_umc_inst_chan_index(m, &umc_inst, &chan_index); > > That's a void function but it could return a pointer to the instance and channel > bundled in a struct maybe... Maybe. I hope its not too much of a concern if it stays the way it is. > > > + > > + dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d," > > + " chan_idx: %d", umc_inst, chan_index); > > + > > + memset(&err_rec, 0x0, sizeof(struct eeprom_table_record)); > > + > > + /* > > + * Translate UMC channel address to Physical address > > + */ > > + retired_page = ADDR_OF_8KB_BLOCK(m->addr) | > > + ADDR_OF_256B_BLOCK(chan_index) | > > + OFFSET_IN_256B_BLOCK(m->addr); > > + > > + err_rec.address = m->addr; > > + err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT; > > + err_rec.ts = (uint64_t)ktime_get_real_seconds(); > > + err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE; > > + err_rec.cu = 0; > > + err_rec.mem_channel = chan_index; > > + err_rec.mcumc_id = umc_inst; > > + > > + err_data.err_addr = &err_rec; > > + err_data.err_addr_cnt = 1; > > + > > + if (amdgpu_bad_page_threshold != 0) { > > + amdgpu_ras_add_bad_pages(adev, err_data.err_addr, > > + err_data.err_addr_cnt); > > + amdgpu_ras_save_bad_pages(adev); > > + } > > + > > + return NOTIFY_OK; > > +} > > + > > +static struct notifier_block amdgpu_bad_page_nb = { > > + .notifier_call = amdgpu_bad_page_notifier, > > + .priority = MCE_PRIO_ACCEL, > > Nothing ever explains why this needs to be a separate priority. So how about it? I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices. I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices are two different class of devices. That is the reason I create a new one. I am OK to use EDAC priority if that is acceptable. > > > +}; > > + > > +static void amdgpu_register_bad_pages_mca_notifier(void) > > +{ > > + /* > > + * Register the x86 notifier with MCE subsystem. > > + * Please note a notifier can be registered only once > > + * with the MCE subsystem. > > + */ > > Why do you need this? Other users don't need that. Do you need to call > mce_unregister_decode_chain() when the driver gets removed or so? > > > + if (notifier_registered == false) { > > This is just silly and should be fixed properly once the issue is understood. A system can have multiple GPUs and we only want a single notifier registered. I will change the comment to explicitly state this. Thanks, Mukul > > -- > Regards/Gruss, > Boris. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople. > kernel.org%2Ftglx%2Fnotes-about- > netiquette&data=04%7C01%7Cmukul.joshi%40amd.com%7C1c231207786 > 446018e7208d915297942%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0 > %7C637564090158211378%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw > MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata > =7LSOnoJWrTf5z96YFACxuRKZP1z4E4zkvtrNzjbTaPs%3D&reserved=0 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx