On Fri, Sep 24, 2021 at 07:46:10PM +0000, Yazen Ghannam wrote: > I agree with you in general. But this device isn't really a GPU. And > users of this device seem to want to count *every* error, at least for > now. Aha, so something accelerator-y where they do general purpose computation. So what's the big picture here: they count all the errors and when they reach a certain amount, they decide to replace the GPUs just in case? Or wait until they become uncorrectable? But then it doesn't matter because we will handle it properly by excluding the VRAM range from further use. Or do they wanna see *when* they had the correctable errors so that they can restart the computation, just in case. Dunno, it would be a lot helpful if we had some RAS strategy for those things... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette