Re: swapper/0: page allocation failure: order:0, mode:0x1204010(GFP_NOWAIT|__GFP_COMP|__GFP_RECLAIMABLE|__GFP_NOTRACK), nodemask=(null)

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Thu, 2 Nov 2017 22:15:06 +0900

I was waiting for Michal's comment, but it seems that he is too busy now.
Thus, I post non-authoritative comment here. (I'm not a tracepoints user.)

On 2017/10/30 6:48, Михаил Гаврилов wrote:
> On 26 October 2017 at 22:49, Михаил Гаврилов
> <mikhail.v.gavrilov@xxxxxxxxx> wrote:
>> On 25 October 2017 at 01:06, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>>>> [ 3551.169126] chrome: page allocation stalls for 11542ms, order:0,
>>>> mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
>>>
>>> this is a sleeping allocation which means that it is allowed to perform
>>> the direct reclaim and that took a lot of time here. This is really
>>> unusual and worth debugging some more.
>>>
>>> [...]
>>>> [ 3551.169590] Mem-Info:
>>>> [ 3551.169595] active_anon:6904352 inactive_anon:520427 isolated_anon:0
>>>>                 active_file:55480 inactive_file:38890 isolated_file:0
>>>>                 unevictable:1836 dirty:556 writeback:0 unstable:0
>>>>                 slab_reclaimable:67559 slab_unreclaimable:95967
>>>>                 mapped:353547 shmem:480723 pagetables:89161 bounce:0
>>>>                 free:49404 free_pcp:1474 free_cma:0
>>>
>>> This tells us that there is quite some page cache (file LRUs) to reclaim
>>> so I am wondering what could have caused such a delay. In order to debug
>>> this some more we would need an additional debugging information. I
>>> usually enable vmscan tracepoints to watch for events during the
>>> reclaim.
>>>
>>
>> I able got the needed tracepoints logs.
>> If I understanded correctly vmscan tracepoints are possible enable by
>> option 1 in the file /sys/kernel/debug/tracing/events/vmscan/enable
>> All archives attached to this email.
>>

Two stalls were found in dmesg but only PID = 2798 part was recorded in the trace logs.

  [ 6109.502115] chrome: page allocation stalls for 10321ms, order:0, mode:0x14000d2(GFP_TEMPORARY|__GFP_HIGHMEM), nodemask=(null)
  [ 6109.502179] chrome cpuset=/ mems_allowed=0
  [ 6109.502570] CPU: 0 PID: 2798 Comm: chrome Not tainted 4.13.9-300.fc27.x86_64+debug #1

So, trying to analyze this one. 

Since 10 seconds of blank was found between mm_shrink_slab_start and
mm_shrink_slab_end, this alone can cause stall warning messages.

  # tracer: nop
  #
  #                              _-----=> irqs-off
  #                             / _----=> need-resched
  #                            | / _---=> hardirq/softirq
  #                            || / _--=> preempt-depth
  #                            ||| /     delay
  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
  #              | |       |   ||||       |         |
            chrome-2798  [000] .N.1  6099.188540: mm_shrink_slab_start: super_cache_scan+0x0/0x1b0 ffff8eefa4651830: nid: 0 objects to shrink 5895 gfp_flags GFP_TEMPORARY|__GFP_HIGHMEM pgs_scanned 90 lru_pgs 6992959 cache items 5049 delta 0 total_scan 2524
            chrome-2798  [000] ...1  6109.494205: mm_shrink_slab_end: super_cache_scan+0x0/0x1b0 ffff8eefa4651830: nid: 0 unused scan count 5895 new scan count 941785 total_scan 476 last shrinker return val 1959

Since need-resched flag was set as of mm_shrink_slab_start and was not set
as of mm_shrink_slab_end, and last shrinker return val is larger than 0,
PID = 2798 has called cond_resched() inside "while" loop in do_shrink_slab().

During this blank, CPU 0 recorded many mm_vmscan_writepage: lines
with flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC. What is strange is that
there was 4.7 seconds of blank inside of 10 seconds of blank.

             <...>-13862 [000] .N.1  6102.008806: mm_vmscan_direct_reclaim_end: nr_reclaimed=42
   qemu-system-x86-13763 [000] ...1  6106.732115: mm_shrink_slab_end: super_cache_scan+0x0/0x1b0 ffff8eefa4651830: nid: 0 unused scan count 7 new scan count 1079551 total_scan 0 last shrinker return val 0

I wonder what CPU 0 was doing for this blank period.

> 
> I was able to catch this issue again.
> Is there anything interesting in the trace logs?

Nothing interesting was recorded. What is interesting is that nothing
about PID = 6542 was recorded in the trace logs. It stalled for more
than 10 seconds without ever hitting vmscan tracepoints!?

  [ 8445.912332] CFileWriterThre: page allocation stalls for 12123ms, order:4, mode:0x140c4c0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
  [ 8445.912355] CFileWriterThre cpuset=/ mems_allowed=0
  [ 8445.912501] CPU: 3 PID: 6542 Comm: CFileWriterThre Not tainted 4.13.9-300.fc27.x86_64+debug #1

I can't tell whether enabling more tracepoints gives us some clue. But
your system might be merely overloaded. Your system is hosting a lot of
processes including QEMU and Chrome on 8 CPUs + 32GB RAM + 64GB swap and
nearly a half of swap is in use, isn't it?

Anyway, this allocation stall warning mechanism is about to be removed
( http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx ).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>