Re: [RFC PATCH v2 3/3] ACPI: extlog: Make print_extlog_rcd() log unconditionally

Borislav Petkov <bp@xxxxxxxxx> · Thu, 16 May 2024 11:57:14 +0200

On Sun, May 12, 2024 at 04:45:08PM -0700, Dan Williams wrote:
> Yes, my point though was that if it got deleted I doubt anyone would
> notice. rasdaemon explicitly does not check the return from
> open("daemon_active").

The intent was for userspace to open it and thus it'll increment
trace_count which then ras_userspace_consumers() reads...

> I am also curious about the history here. This "daemon_active" scheme is
> an awkward way to detect that something is consuming the tracepoint. It
> was added on v4.0, but Steven had added "tracepoint_enabled()" back in
> v3.17:
> 
> 7c65bbc7dcfa tracing: Add trace_<tracepoint>_enabled() function

Ha, I usually talk to Rostedt for all things tracepoint when wondering
how we could use them for RAS purposes but I haven't this time, it
seems.

> So even if non-rasdaemon userspace was watching the extlog tracepoints
> they would not fire because ras_userspace_consumers() prevents it.
>
> I am finding it difficult to see why ras_userspace_consumers() needs to
> continue to be maintained.

Well, you still need some functionality which tests whether a userspace
daemon consumes RAS events. Whether it is something cludgy like now or
something which checks whether all RAS tracepoints have been enabled,
something's gotta be there.

> That would be odd since there is no ras_userspace_consumers() in the
> ACPI GHES path,

Probably because no one's using RAS daemon with GHES. I at least haven't
heard of anyone complaining about this yet...

> so it is already the case that you can get duplicate error information
> depending on which path triggers the error.
>
> Tracepoints are individually configurable.

Sure.

> From my perspective I want alignement between "firmware first" and "OS
> Native" events and I think any movement away from kernel log messages as
> a hardware error mechanism towards tracepoints is a good thing.

That has been the goal for a while now, yap.

Anyone who parses the kernel log for anything serious has been living
under a rock in the last decade at least. :)

> Recall that tracepoints can also be configured to emit to the kernel
> log, so that might be a way to keep legacy kernel log message parsing
> environments happy.

Ok.

> Would be great to hear from folks that have a reasons for kernel log
> message error reporting to continue.

Right, from my experience so far, you never hear anything. :-\

So if we do anything, it should be something simple and which works for
almost everyone.

With RAS, everyone does their own thing. And then there's the firmware
which claims that it can do better RAS but then f*cks up on basic things
like *actually* shipping a working EINJ or whatever implementation.

So in the end of the day it is, oh, we need our drivers in the OS
because we can't fix firmware. It is harder to fix it than *hardware*
:-P

> Uniformity of error response to "fatal" events, but that is mainly a
> PCIe error handling concern not  CPU errors.

Sure, just make sure to keep it simple and generic.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette