Re: [PATCH 2/2] ACPI: extlog: Trace CPER PCI Express Error Section

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tuesday, August 6, 2024 9:56:24 PM GMT+2 Dan Williams wrote:
> Fabio M. De Francesco wrote:
> > Currently, extlog_print() (ELOG) only reports CPER PCIe section (UEFI
> > v2.10, Appendix N.2.7) to the kernel log via print_extlog_rcd().
> 
> I think the critical detail is is that print_extlog_rcd() is only
> triggered when ras_userspace_consumers() returns true. The observation
> is that ras_userspace_consumers() hides information from the trace path
> when the intended purpose of it was to hide duplicate emissions to the
> kernel log when userspace is watching the tracepoints.
>
> Setting aside whether ras_userspace_consumers() is still a good idea or
> not, it is obvious that this patch as is may surprise environments that
> start seeing kernel error logs where the kernel was silent before.
>
> I think the path of least surprise would be to make sure that
> pci_print_aer() optionally skips emitting to the kernel log when not
> needed wanted.

Sorry for replying so late...

I'm not entirely sure that users would not prefer to be surprised by 
_finally_ seeing kernel error logs for failing PCIe components. I suspect 
that users might have been confused by not seeing any output.
 
> So perhaps first do a lead-in patch to optionally quiet the print
> messages in pci_print_aer() and then pass in KERN_DEBUG from the
> extlog_print() path. Then we can decide later what to do about
> ras_userspace_consumers().

Anyway, I'll do it.

> > the similar ghes_do_proc() (GHES) prints to kernel log and calls
> > pci_print_aer() to report via the ftrace infrastructure.
> > 
> > Add support to report the CPER PCIe Error section also via the ftrace
> > infrastructure by calling pci_print_aer() to make ELOG act consistently
> > with GHES.
> 
> You might also want to explain a bit about the motivation for this which
> is that I/O Machine Check Arcitecture events may signal failing PCIe
> components or links. The AER event contains details on what was
> happening on the wire when the error was signaled.

Yes, I agree.

> > 
> > Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> > Signed-off-by: Fabio M. De Francesco 
<fabio.m.de.francesco@xxxxxxxxxxxxxxx>
> > ---
> >  drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
> >  drivers/pci/pcie/aer.c     |  2 +-
> >  include/linux/aer.h        | 13 +++++++++++--
> >  3 files changed, 42 insertions(+), 3 deletions(-)
> > 
> > [...]
> >
> > +		pci_print_aer(pdev, aer_severity, aer);
> 
> ...per above this would become:
> 
>     pci_print_aer(KERN_DEBUG, pdev, aer_severity, aer);
> 
> [..]
> 
> Rest of the changes look good to me.

I need to be sure that I understood...

void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
                   struct aer_capability_regs *aer)
{
        [...]

        if (printk_get_level(level) <= console_loglevel) {
                pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
                        status, mask);
                __aer_print_error(dev, &info);
                pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
                        aer_error_layer[layer], aer_agent_string[agent]);

                if (aer_severity != AER_CORRECTABLE)
                        pci_err(dev, "aer_uncor_severity: 0x%08x\n",
                                aer->uncor_severity);

                if (tlp_header_valid)
                        __print_tlp_header(dev, &aer->header_log);
        }

        [...]
}	

It would require changing a couple of call sites, like in    
aer_recover_work_func():

pci_print_aer(KERN_ERR, pdev, entry.severity, entry.regs);
 
Would you please confirm that the code shown above is what
you asked for?

Thanks,

Fabio







[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux