[patch 02/21] mm: ratelimit PFNs busy info message

akpm@xxxxxxxxxxxxxxxxxxxx · Thu, 10 Aug 2017 15:23:35 -0700

From: Jonathan Toppins <jtoppins@xxxxxxxxxx>
Subject: mm: ratelimit PFNs busy info message

The RDMA subsystem can generate several thousand of these messages per
second eventually leading to a kernel crash.  Ratelimit these messages to
prevent this crash.

Doug said:

: I've been carrying a version of this for several kernel versions.  I don't
: remember when they started, but we have one (and only one) class of
: machines: Dell PE R730xd, that generate these errors.  When it happens,
: without a rate limit, we get rcu timeouts and kernel oopses.  With the
: rate limit, we just get a lot of annoying kernel messages but the machine
: continues on, recovers, and eventually the memory operations all succeed.

And:

: > Well...  why are all these EBUSY's occurring?  It sounds inefficient
: > (at least) but if it is expected, normal and unavoidable then perhaps we
: > should just remove that message altogether?
: 
: I don't have an answer to that question.  To be honest, I haven't
: looked real hard.  We never had this at all, then it started out of the
: blue, but only on our Dell 730xd machines (and it hits all of them),
: but no other classes or brands of machines.  And we have our 730xd
: machines loaded up with different brands and models of cards (for
: instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
: ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
: meant it wasn't tied to any particular brand/model of RDMA hardware. 
: To me, it always smelled of a hardware oddity specific to maybe the
: CPUs or mainboard chipsets in these machines, so given that I'm not an
: mm expert anyway, I never chased it down.
: 
: A few other relevant details: it showed up somewhere around 4.8/4.9 or
: thereabouts.  It never happened before, but the prinkt has been there
: since the 3.18 days, so possibly the test to trigger this message was
: changed, or something else in the allocator changed such that the
: situation started happening on these machines?
: 
: And, like I said, it is specific to our 730xd machines (but they are
: all identical, so that could mean it's something like their specific
: ram configuration is causing the allocator to hit this on these machine
: but not on other machines in the cluster, I don't want to say it's
: necessarily the model of chipset or CPU, there are other bits of
: identicalness between these machines).

Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@xxxxxxxxxx
Signed-off-by: Jonathan Toppins <jtoppins@xxxxxxxxxx>
Reviewed-by: Doug Ledford <dledford@xxxxxxxxxx>
Tested-by: Doug Ledford <dledford@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Hillf Danton <hillf.zj@xxxxxxxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN mm/page_alloc.c~mm-ratelimit-pfns-busy-info-message mm/page_alloc.c

--- a/mm/page_alloc.c~mm-ratelimit-pfns-busy-info-message
+++ a/mm/page_alloc.c
@@ -7669,7 +7669,7 @@ int alloc_contig_range(unsigned long sta
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, false)) {
-		pr_info("%s: [%lx, %lx) PFNs busy\n",
+		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
 			__func__, outer_start, end);
 		ret = -EBUSY;
 		goto done;
_