[AMD Official Use Only - Internal Distribution Only] > -----Original Message----- > From: Borislav Petkov <bp@xxxxxxxxx> > Sent: Thursday, May 13, 2021 5:53 AM > To: Joshi, Mukul <Mukul.Joshi@xxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kasiviswanathan, Harish > <Harish.Kasiviswanathan@xxxxxxx>; x86-ml <x86@xxxxxxxxxx>; lkml <linux- > kernel@xxxxxxxxxxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran > > [CAUTION: External Email] > > On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote: > > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is > defined. > > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile > > the amdgpu driver when CONFIG_X86_MCE_AMD is not defined. > > I can avoid all that by using is_smca_umc_v2(). > > I think it would be cleaner with using is_smca_umc_v2(). > > See how smca_get_long_name() is exported and export that function the same > way. > That's probably not the best example to look at. smca_get_long_name() is used in drivers/edac/mce_amd.c and this file doesn't get compiled when CONFIG_X86_MCE_AMD is not defined. And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD. So here is one option that we can try: 1. Export smca_get_bank_type(). 2. I wrap my entire code in GPU driver with #ifdef CONFIG_X86_MCE_AMD Will that work for you? Thanks, Mukul > To save you some energy: is_smca_umc_v2() is not going to happen. > > > You can think of GPU device as a EDAC device here. It is mainly > > interested in handling uncorrectable errors. > > An EDAC "device", as you call it, is not interested in handling UEs. If anything, it > counts them. > > > It is a deferred interrupt that generates an MCE. > > Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ? > > > When an uncorrectable error is detected on the GPU UMC, all we are > > doing is determining the physical address where the error occurred and > > then "retiring" the page that address belongs to. > > What page is that? Normal DRAM page or a page in some special GPU memory? > > > By retiring, we mean we reserve the page so that it is not available > > for allocations to any applications. > > We do that for normal DRAM memory pages by poisoning them. I hope you > don't mean that. > > Looking at > > amdgpu_ras_add_bad_pages > |-> amdgpu_vram_mgr_reserve_range > > that's some VRAM thing so I'm guessing special memory on the GPU. > > If so, what happens with all those "retired" pages when you reboot? > They're getting used again and potentially trigger the same UEs and the same > retiring happens? > > > We are providing information to the user by storing all the > > information about the retired pages in EEPROM. This can be accessed > > through sysfs. > > Ok, I'm a user and I can access that information through sysfs. What can I do > with it? > > > Hope it clears what "bad page retirement" is achieving. > > It is getting there. > > Thx. > > -- > Regards/Gruss, > Boris. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople. > kernel.org%2Ftglx%2Fnotes-about- > netiquette&data=04%7C01%7CMukul.Joshi%40amd.com%7Cd8c660fce3a2 > 4ce3c6d408d915f4efa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0% > 7C637564964013263414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM > DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata= > %2BnJ%2B99N%2FRljoHGALimZHZG%2Bmf9jL5zP2eA44I6pbzFY%3D&reser > ved=0 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx