On Tuesday, August 6, 2024 9:56:24 PM GMT+2 Dan Williams wrote: > Fabio M. De Francesco wrote: > > Currently, extlog_print() (ELOG) only reports CPER PCIe section (UEFI > > v2.10, Appendix N.2.7) to the kernel log via print_extlog_rcd(). > > I think the critical detail is is that print_extlog_rcd() is only > triggered when ras_userspace_consumers() returns true. The observation > is that ras_userspace_consumers() hides information from the trace path > when the intended purpose of it was to hide duplicate emissions to the > kernel log when userspace is watching the tracepoints. > > Setting aside whether ras_userspace_consumers() is still a good idea or > not, it is obvious that this patch as is may surprise environments that > start seeing kernel error logs where the kernel was silent before. > > I think the path of least surprise would be to make sure that > pci_print_aer() optionally skips emitting to the kernel log when not > needed wanted. Sorry for replying so late... I'm not entirely sure that users would not prefer to be surprised by _finally_ seeing kernel error logs for failing PCIe components. I suspect that users might have been confused by not seeing any output. > So perhaps first do a lead-in patch to optionally quiet the print > messages in pci_print_aer() and then pass in KERN_DEBUG from the > extlog_print() path. Then we can decide later what to do about > ras_userspace_consumers(). Anyway, I'll do it. > > the similar ghes_do_proc() (GHES) prints to kernel log and calls > > pci_print_aer() to report via the ftrace infrastructure. > > > > Add support to report the CPER PCIe Error section also via the ftrace > > infrastructure by calling pci_print_aer() to make ELOG act consistently > > with GHES. > > You might also want to explain a bit about the motivation for this which > is that I/O Machine Check Arcitecture events may signal failing PCIe > components or links. The AER event contains details on what was > happening on the wire when the error was signaled. Yes, I agree. > > > > Cc: Dan Williams <dan.j.williams@xxxxxxxxx> > > Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@xxxxxxxxxxxxxxx> > > --- > > drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++ > > drivers/pci/pcie/aer.c | 2 +- > > include/linux/aer.h | 13 +++++++++++-- > > 3 files changed, 42 insertions(+), 3 deletions(-) > > > > [...] > > > > + pci_print_aer(pdev, aer_severity, aer); > > ...per above this would become: > > pci_print_aer(KERN_DEBUG, pdev, aer_severity, aer); > > [..] > > Rest of the changes look good to me. I need to be sure that I understood... void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity, struct aer_capability_regs *aer) { [...] if (printk_get_level(level) <= console_loglevel) { pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask); __aer_print_error(dev, &info); pci_err(dev, "aer_layer=%s, aer_agent=%s\n", aer_error_layer[layer], aer_agent_string[agent]); if (aer_severity != AER_CORRECTABLE) pci_err(dev, "aer_uncor_severity: 0x%08x\n", aer->uncor_severity); if (tlp_header_valid) __print_tlp_header(dev, &aer->header_log); } [...] } It would require changing a couple of call sites, like in aer_recover_work_func(): pci_print_aer(KERN_ERR, pdev, entry.severity, entry.regs); Would you please confirm that the code shown above is what you asked for? Thanks, Fabio