Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 20/03/2025 09:20, Jon Pan-Doh wrote:
Spammy devices can flood kernel logs with AER errors and slow/stall execution. Add per-device ratelimits for AER correctable and uncorrectable errors that use the kernel defaults (10 per 5s).

Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged while AER stats (cat /sys/bus/pci/devices/<dev>/ aer_dev_correctable) show true count of 11.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer- inject.git
>
Signed-off-by: Jon Pan-Doh <pandoh@xxxxxxxxxx>
Reviewed-by: Karolina Stolarek <karolina.stolarek@xxxxxxxxxx>

For future reference -- please drop r-bs from patches that have functional/bigger changes. New code nullifies previous reviews.

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3069376b3553..081cef5fc678 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -88,6 +89,10 @@ struct aer_report {
  	u64 rootport_total_cor_errs;
  	u64 rootport_total_fatal_errs;
  	u64 rootport_total_nonfatal_errs;
+
+	/* Ratelimits for errors */
+	struct ratelimit_state cor_log_ratelimit;
+	struct ratelimit_state uncor_log_ratelimit;
  };
#define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
@@ -379,6 +384,15 @@ void pci_aer_init(struct pci_dev *dev)
dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL); + /*
+	 * Ratelimits are doubled as a given error produces 2 logs (root port
+	 * and endpoint) that should be under same ratelimit.
+	 */

It took me a bit to understand what this comment is about.

When we handle an error message, we first use the source's ratelimit to decide if we want to print the port info, and then the actual error. In theory, there could be more errors of the same class coming from other devices within one message. For these devices, we would call the ratelimit just once. I don't have a nice an clean solution for this problem, I just wanted to highlight that 1) we don't use the Root Port's ratelimit in aer_print_port_info(), 2) we may use the bursts to either print port_info + error message or just the message, in different combinations. I think we should reword this comment to highlight the fact that we don't check the ratelimit once per error, we could do it twice.

Also, I wonder -- do only Endpoints generate error messages? From what I understand, that some errors can be detected by intermediary devices.

+	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
+	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
+
  	/*
  	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
  	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
@@ -668,6 +682,17 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
  	}
  }
+static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)

I really like this solution, it's nice and tidy


  static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
  {

I'm also thinking -- we are ratelimiting the aer_print_port_info() and aer_print_error(). What about the messages in dpc_process_error()? Should we check early if DPC was triggered because of an uncorrectable error, and if so, ratelimit that?

All the best,
Karolina




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux