[AMD Official Use Only] > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Joshi, > Mukul > Sent: Thursday, July 29, 2021 8:00 PM > To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx> > Cc: x86-ml <x86@xxxxxxxxxx>; Kasiviswanathan, Harish > <Harish.Kasiviswanathan@xxxxxxx>; lkml <linux-kernel@xxxxxxxxxxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Borislav Petkov <bp@xxxxxxxxx>; Alex Deucher > <alexdeucher@xxxxxxxxx> > Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > [AMD Official Use Only] > > > > > -----Original Message----- > > From: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx> > > Sent: Thursday, June 3, 2021 5:13 PM > > To: Joshi, Mukul <Mukul.Joshi@xxxxxxx> > > Cc: Borislav Petkov <bp@xxxxxxxxx>; Alex Deucher > > <alexdeucher@xxxxxxxxx>; x86-ml <x86@xxxxxxxxxx>; Kasiviswanathan, > > Harish <Harish.Kasiviswanathan@xxxxxxx>; lkml > > <linux-kernel@xxxxxxxxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for > > Aldebaran > > > > On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote: > > ... > > > > Is that the same deferred interrupt which calls > > > > amd_deferred_error_interrupt() ? > > > > > > Sorry picking this up after sometime. I thought I had replied to this email. > > > Yes it is the same deferred interrupt which calls > > amd_deferred_error_interrupt(). > > > > > > > Mukul, > > > > Do you expect that the driver will need to mark pages with high > > correctable error counts as bad? I think the hardware folks may want > > the GPU memory errors to be handled more aggressively than CPU memory > > errors. The specific threshold may change from product to product, so > > it may make sense to hardcode this in the driver. > > > > Sorry I missed this email completely. Just saw it so responding now. > > At the moment, we don't have a requirement to mark a page "bad" if there is a > high correctable error counts. > Our previous GPU ASICs which support RAS, also do not have such a feature. > But you make a good point. It might be worthwhile to go and ask the hardware > folks about it. > > > We have similar functionality in the Correctable Errors Collector. But > > enterprise users may prefer a direct approach done in the driver > > (based on the hardware experts' guidance) instead of configuring the kernel at > runtime. > > > > So I think having a separate priority may make sense if some special > > functionality, or combination of behaviors, is needed which don't fall > > under any exisiting things. In this case, "special functionality" > > could be that the GPU memory needs to be handled differently than CPU > memory. > > > > Another thing is that this behavior is similar to the NFIT behavior, > > i.e. there's a memory error on an external device that needs to be > > handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT > > to be generic > > (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same > > priority is okay, right? > > > With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC > instead of creating a new priority as the code in the GPU driver is doing error > detection and handling the uncorrectable errors. > Not sure if that aligns with the definition of EDAC device in the kernel. > > What do you think? > > Regards, > Mukul > After talking to Yazen, MCE_PRIO_UC might be a better choice for the MCE priority as we are dealing only with uncorrectable errors. I will be sending out a v2 patch with changes to use the MCE_PRIO_UC and drop the MCE_PRIO_ACCEL and see what the feedback is. Thanks, Mukul > > Thanks, > > Yazen > _______________________________________________ > amd-gfx mailing list > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre > edesktop.org%2Fmailman%2Flistinfo%2Famd- > gfx&data=04%7C01%7Cmukul.joshi%40amd.com%7C7d32897fddef448ab0 > aa08d952ecf41f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376 > 31999953383488%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YWZz9 > OYTMOhBl4183kV5ZYj01yw0xwNj%2BjTdXejFKH8%3D&reserved=0