On Wed, 13 Jan 2021 10:16:55 +0100 Alexander Potapenko <glider@xxxxxxxxxx> wrote: > With the introduction of various production error-detection tools, such as > MTE-based KASAN and KFENCE, the need arises to efficiently notify the > userspace OS components about kernel errors. Currently, no facility exists > to notify userspace about a kernel error from such bug-detection tools. > The problem is obviously not restricted to the above bug detection tools, > and applies to any error reporting mechanism that does not panic the > kernel; this series, however, will only add support for KASAN and KFENCE > reporting. > > All such error reports appear in the kernel log. But, when such errors > occur, userspace would normally need to read the entire kernel log and > parse the relevant errors. This is error prone and inefficient, as > userspace needs to continuously monitor the kernel log for error messages. > On certain devices, this is unfortunately not acceptable. Therefore, we > need to revisit how reports are propagated to userspace. > > The library added, error_report_notify (CONFIG_ERROR_REPORT_NOTIFY), > solves the above by using the error_report_start/error_report_end tracing > events and exposing the last report and the total report count to the > userspace via /sys/kernel/error_report/last_report and > /sys/kernel/error_report/report_count. > > Userspace apps can call poll(POLLPRI) on those files to get notified about > the new reports without having to watch dmesg in a loop. It would be nice to see some user-facing documentation for this, under Documentation/. How to use it, what the shortcomings are, etc. For instance... what happens when userspace is slow reading /sys/kernel/error_report/last_report? Does that file buffer multiple reports? Does the previous one get overwritten? etc. Words on how this obvious issue is handled... > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -209,6 +209,20 @@ config DEBUG_BUGVERBOSE > of the BUG call as well as the EIP and oops trace. This aids > debugging but costs about 70-100K of memory. > > +config ERROR_REPORT_NOTIFY > + bool "Expose memory error reports to the userspace" There's really nothing "memory" specific about this? Any kernel subsystem could use it? > + depends on TRACING > + help > + When enabled, captures error reports from debugging tools (such as > + KFENCE or KASAN) using console tracing, and exposes reports in > + /sys/kernel/error_report/: the file last_report contains the last > + report (with maximum report length of PAGE_SIZE), and report_count, > + the total report count. > + > + Userspace programs can call poll(POLLPRI) on those files to get > + notified about the new reports without having to watch dmesg in a > + loop. So we have a whole new way of getting debug info out of the kernel. I fear this will become a monster. And anticipating that, we should make darn sure that the interface is right, and is extensible.