On 15/01/2025 08:42, Jon Pan-Doh wrote:
(...)
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 025c50b0f293..1db70ae87f52 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
@@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
}
}
+static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info)
+{
+ const char **strings;
+ const char *errmsg;
+ u16 aer_offset = dev->aer_cap;
+ u16 mask_reg_offset;
+ u32 mask;
+ unsigned long status = info->status;
+ int i;
+
+ if (info->severity == AER_CORRECTABLE) {
+ strings = aer_correctable_error_string;
+ mask_reg_offset = PCI_ERR_COR_MASK;
+ } else {
+ strings = aer_uncorrectable_error_string;
+ mask_reg_offset = PCI_ERR_UNCOR_MASK;
+ }
+
+ pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask);
+ mask |= status;
+ pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask);
+
+ pci_warn(dev, "%s error(s) masked due to rate-limiting:",
+ aer_error_severity_string[info->severity]);
+ for_each_set_bit(i, &status, 32) {
+ errmsg = strings[i];
+ if (!errmsg)
+ errmsg = "Unknown Error Bit";
+
+ pci_warn(dev, " [%2d] %-22s\n", i, errmsg);
+ }
+}
+
My approach was more heavy-handed :)
To confirm that I understand the flow -- when we're processing
aer_err_info, that potentially carries a couple of errors, and we hit a
ratelimit, we mask the error bits in Error Status Register and print
a warning. After doing so, we won't see these types of errors reported
again. What if some time passes (let's say, 2 mins), and we hit a
condition that would normally generate an error but it's now masked? Are
we fine with missing it? I think we should be informed about
Uncorrectable errors as much as possible, as they indicate Link
integrity issues.
- if (!__ratelimit(ratelimit))
+ if (!__ratelimit(irq_ratelimit)) {
+ mask_reported_error(dev, info);
+ return;
+ }
+ if (!__ratelimit(log_ratelimit))
return;
Nit: add a line between these two ifs
All the best,
Karolina