On Fri, Oct 02, 2020 at 06:33:17PM +0100, James Morse wrote: > > I think adding the CPU error collection to the kernel > > has the following advantages, > > 1. The CPU error collection and isolation would not be active if the > > rasdaemon stopped running or not running on a machine. Wasn't there this thing called systemd which promised that it would restart daemons when they fail? And even if it is not there, you can always do your own cronjob which checks rasdaemon presence and restarts it if it has died and sends a mail to the admin to check why it had died. Everything else I've trimmed but James has put it a lot more eloquently than me and I cannot agree more with what he says. Doing this in userspace is better in every aspect you can think of. The current CEC thing runs in the kernel because it has a completely different purpose - to limit corrected error reports which turn into very expensive support calls for errors which were corrected but people simply don't get that they were corrected. Instead, they throw hands in the air and go "OMG, my hardware is failing". Where those are, as James says: > These are corrected errors. Nothing has gone wrong. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette