Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

Mauro Carvalho Chehab <mchehab@xxxxxxxxxx> · Wed, 12 Jun 2019 08:42:13 -0300

Em Wed, 12 Jun 2019 13:00:39 +0200
Borislav Petkov <bp@xxxxxxxxx> escreveu:

> On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote:
> > That's said, from the admin PoV, it makes sense to have a single
> > daemon that collect errors from all error sources and take the
> > needed actions.  
> 
> Doing recovery actions in userspace is too flaky. Daemon can get killed
> at any point in time and there are error types where you want to do
> recovery *before* you return to userspace.

Yeah, some actions would work a lot better at Kernelspace. Yet, some
actions would work a lot better if implemented on userspace.

For example, a server with multiple network interfaces may re-route
the traffic to a backup interface if the main one has too many errors.

This can easily be done on userspace.

> Yes, we do have different error reporting facilities but I still think
> that concentrating all the error information needed in order to do
> proper recovery action is the better approach here. And make that part
> of the kernel so that it is robust. Userspace can still configure it and
> so on.

If the error reporting facilities are for the same hardware "group"
(like the machine's memory controllers), I agree with you: it makes
sense to have a single driver. 

If they are for completely independent hardware then implementing
as separate drivers would work equally well, with the advantage of
making easier to maintain and make it generic enough to support
different vendors using the same IP block.

Thanks,
Mauro