Re: swapper/0: page allocation failure: order:0, mode:0x1204010(GFP_NOWAIT|__GFP_COMP|__GFP_RECLAIMABLE|__GFP_NOTRACK), nodemask=(null)

Michal Hocko <mhocko@xxxxxxxxxx> · Thu, 2 Nov 2017 16:01:20 +0100

On Thu 02-11-17 22:15:06, Tetsuo Handa wrote:
> I was waiting for Michal's comment, but it seems that he is too busy now.
> Thus, I post non-authoritative comment here. (I'm not a tracepoints user.)

yes, that is the case. Thanks for looking into this Tetsuo.

> Two stalls were found in dmesg but only PID = 2798 part was recorded in the trace logs.
> 
>   [ 6109.502115] chrome: page allocation stalls for 10321ms, order:0, mode:0x14000d2(GFP_TEMPORARY|__GFP_HIGHMEM), nodemask=(null)
>   [ 6109.502179] chrome cpuset=/ mems_allowed=0
>   [ 6109.502570] CPU: 0 PID: 2798 Comm: chrome Not tainted 4.13.9-300.fc27.x86_64+debug #1
> 

I have only glanced through the trace data.

> So, trying to analyze this one. 
> 
> Since 10 seconds of blank was found between mm_shrink_slab_start and
> mm_shrink_slab_end, this alone can cause stall warning messages.
> 
>   # tracer: nop
>   #
>   #                              _-----=> irqs-off
>   #                             / _----=> need-resched
>   #                            | / _---=> hardirq/softirq
>   #                            || / _--=> preempt-depth
>   #                            ||| /     delay
>   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
>   #              | |       |   ||||       |         |
>             chrome-2798  [000] .N.1  6099.188540: mm_shrink_slab_start: super_cache_scan+0x0/0x1b0 ffff8eefa4651830: nid: 0 objects to shrink 5895 gfp_flags GFP_TEMPORARY|__GFP_HIGHMEM pgs_scanned 90 lru_pgs 6992959 cache items 5049 delta 0 total_scan 2524
>             chrome-2798  [000] ...1  6109.494205: mm_shrink_slab_end: super_cache_scan+0x0/0x1b0 ffff8eefa4651830: nid: 0 unused scan count 5895 new scan count 941785 total_scan 476 last shrinker return val 1959

Yeah, the direct reclaim has started
chrome-2798  [000] ...1  6099.187991: mm_vmscan_direct_reclaim_begin: order=0 may_writepage=1 gfp_flags=GFP_TEMPORARY|__GFP_HIGHMEM classzone_idx=2
and finished
chrome-2798  [000] ...1  6109.509445: mm_vmscan_direct_reclaim_end: nr_reclaimed=51

the only notable hole in logging was the one pointed by Tetsuo. There is
a lot of activity on that CPU during that time wrt. reclaim
$ grep -v '\-2798' trace.txt | grep '\[000\]' | awk '{val=$4+0; if (val > 6099 && val < 6109) print}' | wc -l
744

And there were more processes involved
$ grep -v '\-2798' trace.txt | grep '\[000\]' | awk '{val=$4+0; if (val > 6099 && val < 6109) print $1}' | sort | uniq -c
     74 <...>-10654
     43 <...>-13862
     82 <...>-17624
      2 <...>-27318
      1 <...>-3518
     37 <...>-5331
    180 <...>-6602
     38 chrome-3482
     40 Chrome_IOThread-2773
      3 DedicatedWorker-19604
      1 gmain-10668
    139 qemu-system-x86-13763
    104 TaskSchedulerBa-6011

So I agree that it looks like your system seems to be overloaded.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>