Re: [PATCH 2/2] ACPI: extlog: Trace CPER PCI Express Error Section

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dan Williams wrote:
> [ add Boris ]

[ actually add Boris ]

Boris, see below, thoughts on deprecating acpi_extlog...

> Bjorn Helgaas wrote:
> > On Mon, May 27, 2024 at 04:43:41PM +0200, Fabio M. De Francesco wrote:
> > > Currently, extlog_print() (ELOG) only reports CPER PCIe section (UEFI
> > > v2.10, Appendix N.2.7) to the kernel log via print_extlog_rcd(). Instead,
> > > the similar ghes_do_proc() (GHES) prints to kernel log and calls
> > > pci_print_aer() to report via the ftrace infrastructure.
> > > 
> > > Add support to report the CPER PCIe Error section also via the ftrace
> > > infrastructure by calling pci_print_aer() to make ELOG act consistently
> > > with GHES.
> > > 
> > > Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> > > Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@xxxxxxxxxxxxxxx>
> > > ---
> > >  drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
> > >  drivers/pci/pcie/aer.c     |  2 +-
> > >  include/linux/aer.h        | 13 +++++++++++--
> > >  3 files changed, 42 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> > > index e025ae390737..007ce96f8672 100644
> > > --- a/drivers/acpi/acpi_extlog.c
> > > +++ b/drivers/acpi/acpi_extlog.c
> > > @@ -131,6 +131,32 @@ static int print_extlog_rcd(const char *pfx,
> > >  	return 1;
> > >  }
> > >  
> > > +static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
> > > +			      int severity)
> > > +{
> > > +	struct aer_capability_regs *aer;
> > > +	struct pci_dev *pdev;
> > > +	unsigned int devfn;
> > > +	unsigned int bus;
> > > +	int aer_severity;
> > > +	int domain;
> > > +
> > > +	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> > > +	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> > > +		aer_severity = cper_severity_to_aer(severity);
> > > +		aer = (struct aer_capability_regs *)pcie_err->aer_info;
> > > +		domain = pcie_err->device_id.segment;
> > > +		bus = pcie_err->device_id.bus;
> > > +		devfn = PCI_DEVFN(pcie_err->device_id.device,
> > > +				  pcie_err->device_id.function);
> > > +		pdev = pci_get_domain_bus_and_slot(domain, bus, devfn);
> > > +		if (!pdev)
> > > +			return;
> > > +		pci_print_aer(pdev, aer_severity, aer);
> > > +		pci_dev_put(pdev);
> > > +	}
> > 
> > I'm 100% in favor of making error reporting work and look the same
> > across GHES and ELOG.  But I do have to gripe a bit...
> > 
> > It's already unfortunate that GHES and the native AER handling are
> > separate paths that lead to the same place (__aer_print_error()).
> > 
> > I'm sorry that we need to add a third path that again does
> > fundamentally the same thing.  The fact that they're separate means
> > all the design, reviewing, testing, and maintenance effort is diluted,
> > and error handling always gets too little love in the first place.
> > I think this is a recipe for confusion.
> > 
> >   ghes_do_proc                                        # GHES
> >     apei_estatus_for_each_section
> >       ...
> >       if (guid_equal(sec_type, &CPER_SEC_PCIE))
> >         ghes_handle_aer
> >           cper_severity_to_aer
> >           aer_recover_queue
> >             kfifo_in_spinlocked(&aer_recover_ring)    # add to queue
> >           aer_recover_work_func                       # another thread
> >             kfifo_get(&aer_recover_ring)              # remove from queue
> >             pci_print_aer
> >               __aer_print_error         <---
> > 
> >   aer_irq                                             # native AER
> >     kfifo_put(&aer_fifo)                              # add to queue
> >   aer_isr                                             # another thread
> >     kfifo_get(&aer_fifo)                              # remove from queue
> >     ...
> >     aer_isr_one_error
> >       aer_process_err_devices
> >         aer_print_error
> >           __aer_print_error             <---
> > 
> >   extlog_print                                        # extlog (x86 only)
> >     apei_estatus_for_each_section
> >       ...
> >       if (guid_equal(sec_type, &CPER_SEC_PCIE))
> >         extlog_print_pcie
> >           cper_severity_to_aer
> >           pci_get_domain_bus_and_slot
> >           pci_print_aer
> >             __aer_print_error           <---
> > 
> > And we also have CXL paths that lead to __aer_print_error(), although
> > it seems like they at least start in the native AER (and maybe GHES?)
> > path and branch out somewhere.  My head is spinning.
> > 
> > Do I *object* to this patch?  No, not really; it's a trivial change in
> > drivers/pci/, and Rafael can add my
> > 
> >   Acked-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> > 
> > as needed.  But I am afraid we're making ourselves a maintenance
> > headache.
> 
> To be honest, I am too. Upon discovering that extlog_print() behaves
> differently than ghes_do_proc(), I had the snarky thought "great, can we
> now just go ahead and deprecate the extlog path, it's just a source of
> maintenance pain.".
> 
> So *if*we keep acpi_extlog it then I definitely think it should be
> consistent with other CPER handlers (needs this patch). But, I am also
> open to entertaining "deprecate it".
> 
> To me, the fact that ras_userspace_consumers() is only honored for
> acpi_extlog is clear evidence that the kernel has already painted itself
> into a confusing user ABI corner and maybe the proper path forward at
> this point is to cut acpi_extlog loose.




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux