Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs

Karolina Stolarek <karolina.stolarek@xxxxxxxxxx> · Thu, 16 Jan 2025 13:02:06 +0100





On 15/01/2025 08:42, Jon Pan-Doh wrote:
(...)
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 025c50b0f293..1db70ae87f52 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
@@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
  	}
  }
  
+static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info)
+{
+	const char **strings;
+	const char *errmsg;
+	u16 aer_offset = dev->aer_cap;
+	u16 mask_reg_offset;
+	u32 mask;
+	unsigned long status = info->status;
+	int i;
+
+	if (info->severity == AER_CORRECTABLE) {
+		strings = aer_correctable_error_string;
+		mask_reg_offset = PCI_ERR_COR_MASK;
+	} else {
+		strings = aer_uncorrectable_error_string;
+		mask_reg_offset = PCI_ERR_UNCOR_MASK;
+	}
+
+	pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask);
+	mask |= status;
+	pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask);
+
+	pci_warn(dev, "%s error(s) masked due to rate-limiting:",
+		 aer_error_severity_string[info->severity]);
+	for_each_set_bit(i, &status, 32) {
+		errmsg = strings[i];
+		if (!errmsg)
+			errmsg = "Unknown Error Bit";
+
+		pci_warn(dev, "   [%2d] %-22s\n", i, errmsg);
+	}
+}
+

My approach was more heavy-handed :)

To confirm that I understand the flow -- when we're processing
aer_err_info, that potentially carries a couple of errors, and we hit a
ratelimit, we mask the error bits in Error Status Register and print
a warning. After doing so, we won't see these types of errors reported
again. What if some time passes (let's say, 2 mins), and we hit a
condition that would normally generate an error but it's now masked? Are
we fine with missing it? I think we should be informed about
Uncorrectable errors as much as possible, as they indicate Link
integrity issues.

-	if (!__ratelimit(ratelimit))
+	if (!__ratelimit(irq_ratelimit)) {
+		mask_reported_error(dev, info);
+		return;
+	}
+	if (!__ratelimit(log_ratelimit))
  		return;

Nit: add a line between these two ifs

All the best,
Karolina