On Sun, May 12, 2024 at 04:45:08PM -0700, Dan Williams wrote: > Yes, my point though was that if it got deleted I doubt anyone would > notice. rasdaemon explicitly does not check the return from > open("daemon_active"). The intent was for userspace to open it and thus it'll increment trace_count which then ras_userspace_consumers() reads... > I am also curious about the history here. This "daemon_active" scheme is > an awkward way to detect that something is consuming the tracepoint. It > was added on v4.0, but Steven had added "tracepoint_enabled()" back in > v3.17: > > 7c65bbc7dcfa tracing: Add trace_<tracepoint>_enabled() function Ha, I usually talk to Rostedt for all things tracepoint when wondering how we could use them for RAS purposes but I haven't this time, it seems. > So even if non-rasdaemon userspace was watching the extlog tracepoints > they would not fire because ras_userspace_consumers() prevents it. > > I am finding it difficult to see why ras_userspace_consumers() needs to > continue to be maintained. Well, you still need some functionality which tests whether a userspace daemon consumes RAS events. Whether it is something cludgy like now or something which checks whether all RAS tracepoints have been enabled, something's gotta be there. > That would be odd since there is no ras_userspace_consumers() in the > ACPI GHES path, Probably because no one's using RAS daemon with GHES. I at least haven't heard of anyone complaining about this yet... > so it is already the case that you can get duplicate error information > depending on which path triggers the error. > > Tracepoints are individually configurable. Sure. > From my perspective I want alignement between "firmware first" and "OS > Native" events and I think any movement away from kernel log messages as > a hardware error mechanism towards tracepoints is a good thing. That has been the goal for a while now, yap. Anyone who parses the kernel log for anything serious has been living under a rock in the last decade at least. :) > Recall that tracepoints can also be configured to emit to the kernel > log, so that might be a way to keep legacy kernel log message parsing > environments happy. Ok. > Would be great to hear from folks that have a reasons for kernel log > message error reporting to continue. Right, from my experience so far, you never hear anything. :-\ So if we do anything, it should be something simple and which works for almost everyone. With RAS, everyone does their own thing. And then there's the firmware which claims that it can do better RAS but then f*cks up on basic things like *actually* shipping a working EINJ or whatever implementation. So in the end of the day it is, oh, we need our drivers in the OS because we can't fix firmware. It is harder to fix it than *hardware* :-P > Uniformity of error response to "fatal" events, but that is mainly a > PCIe error handling concern not CPU errors. Sure, just make sure to keep it simple and generic. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette