CC'd back to LKML. On Thu, Jun 29, 2017 at 06:11:00PM +0530, Kumar Abhishek wrote: > Hi Robin, > > I am an independent developer who stumbled upon your thread on the LKML > after facing a similar issue - my kernel log being spammed by > alloc_contig_range messages. I am running Linux on an ARM system > (specifically the BeagleBoard-X15) and am on kernel version 4.9.33 with TI > patches on top of it. > > I am running Debian Stretch (9.0) on the system. > > Here's what my stack trace looks like: .. > > It's somewhat similar to your stack trace, but this here happens on an > etnaviv GPU (Vivante GCxx). > > In my case if I do 'sudo service lightdm stop', these messages stop too. > This seems to suggest that the problem may be in the X server rather than > the kernel? I seem to think this because I replicated this on an entirely > different set of hardware than yours. > > I just wanted to bring this to your notice, and also ask you if you managed > to solve it for yourself. > > One solution could be to demote the pr_info in alloc_contig_range to > pr_debug or to do away with the message altogether, but this would be > suppressing the issue instead of really knowing what it is about. > > Let me know how I could further investigate this. The problem, as far as I got diagnosed on LKML, is that some of the GPUs have a bunch of non-fatal contiguous memory allocation requests: they have a meaningful fallback path on the allocation, so 'PFNs busy' is a false busy for their case. However, if there was a another consumer that does NOT have a fallback, the output would still be crucially useful. Attached is the patch that I unsuccessfully proposed on LKML to rate-limit the messages, with the last revision to only dump_stack() if CONFIG_CMA_DEBUG was set. The path that LKML wanted was to add a new parameter to suppress or at least demote the failure message, and update all of the callers: but it means that many of the indirect callers need that added parameter as well. mm/cma.c:cma_alloc this call can suppress the error, you can see it retry. mm/hugetlb.c: These callers should get the error message. The error message DOES still have a good general use in notifying you that something is going wrong. There was noticeable performance slowdown in my case when it was trying hard to allocate. -- Robin Hugh Johnson E-Mail : robbat2@xxxxxxxxxxxxxxxxxx Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 ICQ# : 30269588 or 41961639 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
commit 808c209dc82ce79147122ca78e7047bc74a16149 Author: Robin H. Johnson <robbat2@xxxxxxxxxx> Date: Wed Nov 30 10:32:57 2016 -0800 mm: ratelimit & trace PFNs busy. Signed-off-by: Robin H. Johnson <robbat2@xxxxxxxxxx> Acked-by: Michal Nazarewicz <mina86@xxxxxxxxxx> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6de9440e3ae2..3c28ec3d18f8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7289,8 +7289,16 @@ int alloc_contig_range(unsigned long start, unsigned long end, /* Make sure the range is really isolated. */ if (test_pages_isolated(outer_start, end, false)) { - pr_info("%s: [%lx, %lx) PFNs busy\n", - __func__, outer_start, end); + static DEFINE_RATELIMIT_STATE(ratelimit_pfn_busy, + DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + if (__ratelimit(&ratelimit_pfn_busy)) { + pr_info("%s: [%lx, %lx) PFNs busy\n", + __func__, outer_start, end); + if (IS_ENABLED(CONFIG_CMA_DEBUG)) + dump_stack(); + } + ret = -EBUSY; goto done; }
Attachment:
signature.asc
Description: Digital signature