Borislav Petkov wrote: > On Fri, May 10, 2024 at 03:12:36PM -0700, Dan Williams wrote: > > I had asked Fabio to take a look at whether it made sense to continue > > with the concept of ras_userspace_consumers() especially since it seems > > limited to the EXTLOG case. > > > > In general I am finding that between OS Native and Firmware First error > > reporting the logging approaches are inconsistent. > > > > As far I can see rasdaemon would not even notice is the "daemon_active" > > debugfs file went away [1], > > It tells the kernel that it is consuming the error info from the > tracepoints. Yes, my point though was that if it got deleted I doubt anyone would notice. rasdaemon explicitly does not check the return from open("daemon_active"). I am also curious about the history here. This "daemon_active" scheme is an awkward way to detect that something is consuming the tracepoint. It was added on v4.0, but Steven had added "tracepoint_enabled()" back in v3.17: 7c65bbc7dcfa tracing: Add trace_<tracepoint>_enabled() function So even if non-rasdaemon userspace was watching the extlog tracepoints they would not fire because ras_userspace_consumers() prevents it. I am finding it difficult to see why ras_userspace_consumers() needs to continue to be maintained. > > and it should be the case that the tracepoints always fire whether > > daemon_active is open or not. > > > > So I was expecting this removal to be a conversation starter on the > > wider topic of error reporting consistency. > > Yeah, and then they'll come and say: ew, we're getting error duplicates > - once logged in dmesg and once through the tracepoints. That would be odd since there is no ras_userspace_consumers() in the ACPI GHES path, so it is already the case that you can get duplicate error information depending on which path triggers the error. Tracepoints are individually configurable. > So just like with the other thread, we have to figure out what our > scheme will be wrt hw error logging, agree on it and then make it > consistent. >From my perspective I want alignement between "firmware first" and "OS Native" events and I think any movement away from kernel log messages as a hardware error mechanism towards tracepoints is a good thing. Recall that tracepoints can also be configured to emit to the kernel log, so that might be a way to keep legacy kernel log message parsing environments happy. > Do we want to have both? Should it be configurable? Probably... Would be great to hear from folks that have a reasons for kernel log message error reporting to continue. > Anything else...? Uniformity of error response to "fatal" events, but that is mainly a PCIe error handling concern not CPU errors.