Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

Borislav Petkov <bp@xxxxxxxxx> · Sat, 25 Sep 2021 13:20:57 +0200

On Fri, Sep 24, 2021 at 07:46:10PM +0000, Yazen Ghannam wrote:
> I agree with you in general. But this device isn't really a GPU. And
> users of this device seem to want to count *every* error, at least for
> now.

Aha, so something accelerator-y where they do general purpose computation.

So what's the big picture here: they count all the errors and when they
reach a certain amount, they decide to replace the GPUs just in case?

Or wait until they become uncorrectable? But then it doesn't matter
because we will handle it properly by excluding the VRAM range from
further use.

Or do they wanna see *when* they had the correctable errors so that they
can restart the computation, just in case.

Dunno, it would be a lot helpful if we had some RAS strategy for those
things...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette