Re: Very slow page fault caused by migrate_pages

Hillf Danton <hdanton@xxxxxxxx> · Sun, 16 May 2021 16:56:44 +0800

On Sun, 16 May 2021 15:39:56 +0800 Hu Weiwen wrote:
>On Sun, May 16, 2021 at 10:50:35AM +0800, Hillf Danton wrote:
>> On Sun, 16 May 2021 01:01:04 +0800 Hu Weiwen wrote:
>> > Hi all,
>> > 
>> > We observed our program sometimes running over 1000x slower than normal and
>> > cannot find the root cause. I would be grateful if someone can give me a hint. I
>> > also searched on this mailing list but failed to find anything related.
>> > 
>> > The affected process spend almost 100% CPU time in kernel. First I ran `perf' to
>> > identify where times are spent. It turns out 99.93% of the time is spent on
>> > do_page_fault, and 98.16% on migrate_pages. And `perf record -e faults -t <tid>`
>> > show 3 seconds per fault, which seems to much to me.
>> > 
>> > Then I do `perf record -e compaction:*`, and see 12k
>> > compaction:mm_compaction_migratepages events per second. almost all of them has
>> > nr_migrated=0 and nr_failed greater than 0. So it seems a lot of migration
>> > failure is happening. But I cannot understand what causes it. The more
>> > complete perf output is attached below.
>> > 
>> > This system is used to run machine-learning training programs,
>> > with 2 Intel Xeon CPU, 512GB RAM and 8 NVIDIA GPU attached,
>> > running Ubuntu 20.04, with kernel version 5.4.0-72-generic.
>> 
>> Given the do_huge_pmd_anonymous_page in your perf report below, what you
>> need is likely commit facdaa917c4d ("mm: proactive compaction").
>> 
>> You need to port it back to the linux-5.4 powering your box if there is
>> anying that prevents you from using the mainline kernel. Feel free to
>> report again if it does not work for you.
>
>Thank you for your reply. Will try it. But it is hard to say whether it is
>working. We can only reproduce this about once per month on 10+ similar boxes.
>
>But I think things still not adding up.
>
>The last time I see this, I left it running in this state for 6 hours. But it
>still does not get better. If it is just fragmented memory that caused too much
>compaction on huge page allocation, I'd expect it to solve itself in minutes.

Sounds good given "367GB of RAM is available when we observe the problem."
(free -g illustrates the available RAM.)

>Commit message of facdaa917c4d reported 700GB memory being allocated in 27
>minutes without that patch, but we hardly have any progress in 6 hours.
>
>Also, Commit message of facdaa917c4d reported at most 33ms latency on allocation
>without that patch, but we see 3 seconds per page fault.
>
>Over 99% of pages passed to `migrate_pages' are failed to migrate. This is
>concerning. It might relate to our issue. I plan to investigate more on this
>when I see the issue next time. Any tips on how to find out why migrate failed
>are appreciated.

No idea of such tips if they exist instead of migrate.c.
If d479960e44f2 ("mm: disable LRU pagevec during the migration temporarily")
is not a cure then try to see which commit works with GPU module unloaded.

>
>> > Load average is 13, and 367GB of RAM is available when
>> > we observe the problem this time. We have observed the very same problem on
>> > multiple nodes, running different programs, with different models of CPU, GPU.
>> > 
>> > Currently, only reboot the system can resolve this. When this happens, all
>> > process on the system is affected, and it does not resolve if I restart the
>> > affected process.
>> > 
>> > Thanks,
>> > Hu Weiwen
>> > SCUT student
>> > 
>> > 
>> > `perf report` output, collected with `sudo perf record -F 99 --pid 2413158 -g -- sleep 10`:
>> > 
>> > Samples: 993  of event 'cycles', Event count (approx.): 36810932529
>> >   Children      Self  Command         Shared Object       Symbol
>> > -   99.93%     0.00%  python          libc-2.31.so        [.] 0x00007ff5fcc78984
>> >      0x7ff5fcc78984
>> >      page_fault
>> >      do_page_fault
>> >      __do_page_fault
>> >      do_user_addr_fault
>> >      handle_mm_fault
>> >      __handle_mm_fault
>> >      do_huge_pmd_anonymous_page
>> >      alloc_pages_vma
>> >      __alloc_pages_nodemask
>> >      __alloc_pages_slowpath
>> >      __alloc_pages_direct_compact
>> >      try_to_compact_pages
>> >      compact_zone_order
>> >    - compact_zone
>> >       - 98.16% migrate_pages
>> >          - 93.30% try_to_unmap
>> >             - rmap_walk
>> >                - 92.12% rmap_walk_file
>> >                   - 90.45% try_to_unmap_one
>> >                      - 83.48% ptep_clear_flush
>> >                         - 76.33% flush_tlb_mm_range
>> >                            - 74.35% native_flush_tlb_others
>> >                               - 73.75% on_each_cpu_cond_mask
>> >                                  - 70.85% on_each_cpu_mask
>> >                                     - 69.57% smp_call_function_many
>> >                                        - 67.20% smp_call_function_single
>> >                                             4.93% llist_add_batch
>> >                                           + 1.09% generic_exec_single
>> >                                       1.19% smp_call_function_single
>> >                                  + 2.01% zalloc_cpumask_var
>> >                        1.45% page_vma_mapped_walk
>> >                     0.78% down_read
>> >                  0.79% up_read
>> >          + 3.29% remove_migration_ptes
>> >            0.59% move_to_new_page
>> > 
>> > 
>> > `perf report` output, collected with `sudo perf record -e compaction:* -a -- sleep 10`:
>> > 
>> > Samples: 133K of event 'compaction:mm_compaction_migratepages', Event count (approx.): 133394
>> > Overhead  Trace output
>> >   26.25%  nr_migrated=0 nr_failed=32
>> >    5.00%  nr_migrated=0 nr_failed=2
>> >    4.64%  nr_migrated=0 nr_failed=4
>> >    4.24%  nr_migrated=0 nr_failed=3
>> >    4.20%  nr_migrated=0 nr_failed=6
>> >    3.83%  nr_migrated=0 nr_failed=1
>> >    3.79%  nr_migrated=0 nr_failed=5
>> >    3.62%  nr_migrated=0 nr_failed=7
>> >    3.45%  nr_migrated=0 nr_failed=8
>> >    3.06%  nr_migrated=0 nr_failed=9
>> >    2.68%  nr_migrated=0 nr_failed=10
>> >    2.68%  nr_migrated=0 nr_failed=11
>> >    2.42%  nr_migrated=0 nr_failed=12
>> >    2.35%  nr_migrated=0 nr_failed=13
>> >    2.11%  nr_migrated=0 nr_failed=14
>> >    1.93%  nr_migrated=0 nr_failed=16
>> >    1.90%  nr_migrated=0 nr_failed=15
>> >    1.64%  nr_migrated=0 nr_failed=17
>> >    1.50%  nr_migrated=0 nr_failed=18
>> >    1.25%  nr_migrated=0 nr_failed=19
>> >    1.25%  nr_migrated=0 nr_failed=20
>> >    1.22%  nr_migrated=0 nr_failed=21
>> >    ...
>> >    0.00%  nr_migrated=0 nr_failed=207
>> >    0.00%  nr_migrated=0 nr_failed=258
>> >    0.00%  nr_migrated=0 nr_failed=273
>> >    0.00%  nr_migrated=0 nr_failed=277
>> >    0.00%  nr_migrated=10 nr_failed=22
>> >    0.00%  nr_migrated=13 nr_failed=19
>> >    0.00%  nr_migrated=2 nr_failed=2
>> >    0.00%  nr_migrated=3 nr_failed=13
>> >    0.00%  nr_migrated=3 nr_failed=17
>> >    0.00%  nr_migrated=3 nr_failed=25
>> >    0.00%  nr_migrated=3 nr_failed=27
>> >    0.00%  nr_migrated=3 nr_failed=28
>> >    0.00%  nr_migrated=4 nr_failed=10
>> >    0.00%  nr_migrated=4 nr_failed=15
>> >    0.00%  nr_migrated=4 nr_failed=24
>> >    0.00%  nr_migrated=5 nr_failed=10
>> >    0.00%  nr_migrated=5 nr_failed=15
>> >    0.00%  nr_migrated=6 nr_failed=20
>> >    0.00%  nr_migrated=6 nr_failed=21
>> >    0.00%  nr_migrated=6 nr_failed=23
>> >    0.00%  nr_migrated=8 nr_failed=19
>
>