Re: [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Thu, 6 Mar 2025 18:02:20 -0600

On Wed, Mar 05, 2025 at 05:32:45PM -0800, Jon Pan-Doh wrote:
> > On Tue, Mar 04, 2025 at 05:04:21PM -0800, Jon Pan-Doh wrote:
> > > Would a log suffice in that case (i.e. when aer_get_device_error()
> > > returns 0)? Something along the lines of "{device} is not accessible
> > > while processing (un)correctable error"
> 
> What are your thoughts on this? It adds the pcie port log in the
> edge case described (with no loss of info) and doesn't require
> changes to current ratelimit logic. Something like this (with more
> fields filled in of course):
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 21cdf590b25e..bdfc7e8d6f0f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1253,6 +1253,8 @@ static inline void
> aer_process_err_devices(struct aer_err_info *e_info)
>         for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>                 if (aer_get_device_error_info(e_info->dev[i], e_info))
>                         aer_print_error(e_info->dev[i], e_info);
> +               else
> +                       pci_error(e_info->dev[i], "{device} is not
> accessible while processing (un)correctable error");
>         }
>         for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>                 if (aer_get_device_error_info(e_info->dev[i], e_info))

Maybe, although I think consistency is very important, and we'll
always have Root Port info but won't always have Endpoint info.  So
dropping the Root Port message seems possibly the wrong way around
when it's the Endpoint part that's "optional".

One thing I do like about the current messages is that they associate
information with the device that is the source of the information.  I
remember finding this very confusing when I first looked at how AER
works.

E.g., the "pcieport ... Correctable error" message means the Root Port
received an ERR_COR and generated an interrupt, and the error class
and error source came from the Root Port AER Capability.  Similarly,
the "e1000e ... error status" message contains information read from
the Endpoint AER Capability.

I do think the existing messages are WAY too verbose.  I would love to
make them more concise, and I think the important endpoint info could
probably be squeezed into a single line, although obviously TLP header
logs would be too much for that.

Bjorn