On Sun, 16 May 2021 15:39:56 +0800 Hu Weiwen wrote: >On Sun, May 16, 2021 at 10:50:35AM +0800, Hillf Danton wrote: >> On Sun, 16 May 2021 01:01:04 +0800 Hu Weiwen wrote: >> > Hi all, >> > >> > We observed our program sometimes running over 1000x slower than normal and >> > cannot find the root cause. I would be grateful if someone can give me a hint. I >> > also searched on this mailing list but failed to find anything related. >> > >> > The affected process spend almost 100% CPU time in kernel. First I ran `perf' to >> > identify where times are spent. It turns out 99.93% of the time is spent on >> > do_page_fault, and 98.16% on migrate_pages. And `perf record -e faults -t <tid>` >> > show 3 seconds per fault, which seems to much to me. >> > >> > Then I do `perf record -e compaction:*`, and see 12k >> > compaction:mm_compaction_migratepages events per second. almost all of them has >> > nr_migrated=0 and nr_failed greater than 0. So it seems a lot of migration >> > failure is happening. But I cannot understand what causes it. The more >> > complete perf output is attached below. >> > >> > This system is used to run machine-learning training programs, >> > with 2 Intel Xeon CPU, 512GB RAM and 8 NVIDIA GPU attached, >> > running Ubuntu 20.04, with kernel version 5.4.0-72-generic. >> >> Given the do_huge_pmd_anonymous_page in your perf report below, what you >> need is likely commit facdaa917c4d ("mm: proactive compaction"). >> >> You need to port it back to the linux-5.4 powering your box if there is >> anying that prevents you from using the mainline kernel. Feel free to >> report again if it does not work for you. > >Thank you for your reply. Will try it. But it is hard to say whether it is >working. We can only reproduce this about once per month on 10+ similar boxes. > >But I think things still not adding up. > >The last time I see this, I left it running in this state for 6 hours. But it >still does not get better. If it is just fragmented memory that caused too much >compaction on huge page allocation, I'd expect it to solve itself in minutes. Sounds good given "367GB of RAM is available when we observe the problem." (free -g illustrates the available RAM.) >Commit message of facdaa917c4d reported 700GB memory being allocated in 27 >minutes without that patch, but we hardly have any progress in 6 hours. > >Also, Commit message of facdaa917c4d reported at most 33ms latency on allocation >without that patch, but we see 3 seconds per page fault. > >Over 99% of pages passed to `migrate_pages' are failed to migrate. This is >concerning. It might relate to our issue. I plan to investigate more on this >when I see the issue next time. Any tips on how to find out why migrate failed >are appreciated. No idea of such tips if they exist instead of migrate.c. If d479960e44f2 ("mm: disable LRU pagevec during the migration temporarily") is not a cure then try to see which commit works with GPU module unloaded. > >> > Load average is 13, and 367GB of RAM is available when >> > we observe the problem this time. We have observed the very same problem on >> > multiple nodes, running different programs, with different models of CPU, GPU. >> > >> > Currently, only reboot the system can resolve this. When this happens, all >> > process on the system is affected, and it does not resolve if I restart the >> > affected process. >> > >> > Thanks, >> > Hu Weiwen >> > SCUT student >> > >> > >> > `perf report` output, collected with `sudo perf record -F 99 --pid 2413158 -g -- sleep 10`: >> > >> > Samples: 993 of event 'cycles', Event count (approx.): 36810932529 >> > Children Self Command Shared Object Symbol >> > - 99.93% 0.00% python libc-2.31.so [.] 0x00007ff5fcc78984 >> > 0x7ff5fcc78984 >> > page_fault >> > do_page_fault >> > __do_page_fault >> > do_user_addr_fault >> > handle_mm_fault >> > __handle_mm_fault >> > do_huge_pmd_anonymous_page >> > alloc_pages_vma >> > __alloc_pages_nodemask >> > __alloc_pages_slowpath >> > __alloc_pages_direct_compact >> > try_to_compact_pages >> > compact_zone_order >> > - compact_zone >> > - 98.16% migrate_pages >> > - 93.30% try_to_unmap >> > - rmap_walk >> > - 92.12% rmap_walk_file >> > - 90.45% try_to_unmap_one >> > - 83.48% ptep_clear_flush >> > - 76.33% flush_tlb_mm_range >> > - 74.35% native_flush_tlb_others >> > - 73.75% on_each_cpu_cond_mask >> > - 70.85% on_each_cpu_mask >> > - 69.57% smp_call_function_many >> > - 67.20% smp_call_function_single >> > 4.93% llist_add_batch >> > + 1.09% generic_exec_single >> > 1.19% smp_call_function_single >> > + 2.01% zalloc_cpumask_var >> > 1.45% page_vma_mapped_walk >> > 0.78% down_read >> > 0.79% up_read >> > + 3.29% remove_migration_ptes >> > 0.59% move_to_new_page >> > >> > >> > `perf report` output, collected with `sudo perf record -e compaction:* -a -- sleep 10`: >> > >> > Samples: 133K of event 'compaction:mm_compaction_migratepages', Event count (approx.): 133394 >> > Overhead Trace output >> > 26.25% nr_migrated=0 nr_failed=32 >> > 5.00% nr_migrated=0 nr_failed=2 >> > 4.64% nr_migrated=0 nr_failed=4 >> > 4.24% nr_migrated=0 nr_failed=3 >> > 4.20% nr_migrated=0 nr_failed=6 >> > 3.83% nr_migrated=0 nr_failed=1 >> > 3.79% nr_migrated=0 nr_failed=5 >> > 3.62% nr_migrated=0 nr_failed=7 >> > 3.45% nr_migrated=0 nr_failed=8 >> > 3.06% nr_migrated=0 nr_failed=9 >> > 2.68% nr_migrated=0 nr_failed=10 >> > 2.68% nr_migrated=0 nr_failed=11 >> > 2.42% nr_migrated=0 nr_failed=12 >> > 2.35% nr_migrated=0 nr_failed=13 >> > 2.11% nr_migrated=0 nr_failed=14 >> > 1.93% nr_migrated=0 nr_failed=16 >> > 1.90% nr_migrated=0 nr_failed=15 >> > 1.64% nr_migrated=0 nr_failed=17 >> > 1.50% nr_migrated=0 nr_failed=18 >> > 1.25% nr_migrated=0 nr_failed=19 >> > 1.25% nr_migrated=0 nr_failed=20 >> > 1.22% nr_migrated=0 nr_failed=21 >> > ... >> > 0.00% nr_migrated=0 nr_failed=207 >> > 0.00% nr_migrated=0 nr_failed=258 >> > 0.00% nr_migrated=0 nr_failed=273 >> > 0.00% nr_migrated=0 nr_failed=277 >> > 0.00% nr_migrated=10 nr_failed=22 >> > 0.00% nr_migrated=13 nr_failed=19 >> > 0.00% nr_migrated=2 nr_failed=2 >> > 0.00% nr_migrated=3 nr_failed=13 >> > 0.00% nr_migrated=3 nr_failed=17 >> > 0.00% nr_migrated=3 nr_failed=25 >> > 0.00% nr_migrated=3 nr_failed=27 >> > 0.00% nr_migrated=3 nr_failed=28 >> > 0.00% nr_migrated=4 nr_failed=10 >> > 0.00% nr_migrated=4 nr_failed=15 >> > 0.00% nr_migrated=4 nr_failed=24 >> > 0.00% nr_migrated=5 nr_failed=10 >> > 0.00% nr_migrated=5 nr_failed=15 >> > 0.00% nr_migrated=6 nr_failed=20 >> > 0.00% nr_migrated=6 nr_failed=21 >> > 0.00% nr_migrated=6 nr_failed=23 >> > 0.00% nr_migrated=8 nr_failed=19 > >