On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > From: Rajat Khandelwal <rajat.khandelwal@xxxxxxxxxxxxxxx> > > > > There are many instances where correctable errors tend to inundate > > the message buffer. We observe such instances during thunderbolt PCIe > > tunneling. > > > > It's true that they are mitigated by the hardware and are non-fatal > > but we shouldn't be spamming the logs with such correctable errors as it > > confuses other kernel developers less familiar with PCI errors, support > > staff, and users who happen to look at the logs, hence rate limit them. > > > > A typical example log inside an HP TBT4 dock: > > [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0 > > [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > > [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000 > > [54912.661211] igc 0000:2b:00.0: [ 8] Rollover > > [54912.661219] igc 0000:2b:00.0: [12] Timeout > > [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0 > > [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000 > > [54982.838817] igc 0000:2b:00.0: [12] Timeout > > The timestamps don't contribute to understanding the problem, so we > can omit them. Ok. > > This gets repeated continuously, thus inundating the buffer. > > > > Signed-off-by: Rajat Khandelwal <rajat.khandelwal@xxxxxxxxxxxxxxx> > > Signed-off-by: Grant Grundler <grundler@xxxxxxxxxxxx> > > --- > > drivers/pci/pcie/aer.c | 42 ++++++++++++++++++++++++++++-------------- > > 1 file changed, 28 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index cb6b96233967..b592cea8bffe 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -706,8 +706,8 @@ static void __aer_print_error(struct pci_dev *dev, > > errmsg = "Unknown Error Bit"; > > > > if (info->severity == AER_CORRECTABLE) > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > - info->first_error == i ? " (First)" : ""); > > + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg, > > + info->first_error == i ? " (First)" : ""); > > I don't think this is going to reliably work the way we want. We have > a bunch of pci_info_ratelimited() calls, and each caller has its own > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > the same number of times for each error, the ratelimit counters will > get out of sync and we'll end up printing fragments from error A mixed > with fragments from error B. Ok - what I'm reading between the lines here is the output should be emitted in one step, not multiple pci_info_ratelimited() calls. if the code built an output string (using sprintnf()), and then called pci_info_ratelimited() exactly once at the bottom, would that be sufficient? > I think we need to explicitly manage the ratelimiting ourselves, > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > have a *single* ratelimit_state, and we can check it once to determine > whether to log this correctable error. Is the rate limiting per call location or per device? From above, I understood rate limiting is "per call location". If the code only has one call location, it should achieve the same goal, right? cheers, grant > > > else > > pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, > > info->first_error == i ? " (First)" : ""); > > @@ -719,7 +719,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > > { > > int layer, agent; > > int id = ((dev->bus->number << 8) | dev->devfn); > > - const char *level; > > > > if (!info->status) { > > pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n", > > @@ -730,14 +729,21 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > > agent = AER_GET_AGENT(info->severity, info->status); > > > > - level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR; > > + if (info->severity == AER_CORRECTABLE) { > > + pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > + aer_error_severity_string[info->severity], > > + aer_error_layer[layer], aer_agent_string[agent]); > > > > - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > - aer_error_severity_string[info->severity], > > - aer_error_layer[layer], aer_agent_string[agent]); > > + pci_info_ratelimited(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > > + dev->vendor, dev->device, info->status, info->mask); > > + } else { > > + pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n", > > + aer_error_severity_string[info->severity], > > + aer_error_layer[layer], aer_agent_string[agent]); > > > > - pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > > - dev->vendor, dev->device, info->status, info->mask); > > + pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n", > > + dev->vendor, dev->device, info->status, info->mask); > > + } > > > > __aer_print_error(dev, info); > > > > @@ -757,11 +763,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info) > > u8 bus = info->id >> 8; > > u8 devfn = info->id & 0xff; > > > > - pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > - info->multi_error_valid ? "Multiple " : "", > > - aer_error_severity_string[info->severity], > > - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > - PCI_FUNC(devfn)); > > + if (info->severity == AER_CORRECTABLE) > > + pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > + info->multi_error_valid ? "Multiple " : "", > > + aer_error_severity_string[info->severity], > > + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > + PCI_FUNC(devfn)); > > + else > > + pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > + info->multi_error_valid ? "Multiple " : "", > > + aer_error_severity_string[info->severity], > > + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > + PCI_FUNC(devfn)); > > + > > } > > > > #ifdef CONFIG_ACPI_APEI_PCIEAER > > -- > > 2.40.0.rc1.284.g88254d51c5-goog > >