On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote: ... > > Is that the same deferred interrupt which calls > > amd_deferred_error_interrupt() ? > > Sorry picking this up after sometime. I thought I had replied to this email. > Yes it is the same deferred interrupt which calls amd_deferred_error_interrupt(). > Mukul, Do you expect that the driver will need to mark pages with high correctable error counts as bad? I think the hardware folks may want the GPU memory errors to be handled more aggressively than CPU memory errors. The specific threshold may change from product to product, so it may make sense to hardcode this in the driver. We have similar functionality in the Correctable Errors Collector. But enterprise users may prefer a direct approach done in the driver (based on the hardware experts' guidance) instead of configuring the kernel at runtime. So I think having a separate priority may make sense if some special functionality, or combination of behaviors, is needed which don't fall under any exisiting things. In this case, "special functionality" could be that the GPU memory needs to be handled differently than CPU memory. Another thing is that this behavior is similar to the NFIT behavior, i.e. there's a memory error on an external device that needs to be handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT to be generic (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same priority is okay, right? Thanks, Yazen _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx