----- Original Message ----- > > > On 5/7/19 4:51 AM, Jan Stancek wrote: > > ----- Original Message ----- > >> > >> On 5/6/19 2:35 PM, Jan Stancek wrote: > >>> ----- Original Message ----- > >>>> On 5/5/19 7:10 AM, Jan Stancek wrote: > >>>>> Hi, > >>>>> > >>>>> I'm seeing userspace program getting stuck on aarch64, on kernels 4.20 > >>>>> and > >>>>> newer. > >>>>> It stalls from seconds to hours. > >>>>> > >>>>> I have simplified it to following scenario (reproducer linked below > >>>>> [1]): > >>>>> while (1): > >>>>> spawn Thread 1: mmap, write, munmap > >>>>> spawn Thread 2: <nothing> > >>>>> > >>>>> Thread 1 is sporadically getting stuck on write to mapped area. > >>>>> User-space > >>>>> is not > >>>>> moving forward - stdout output stops. Observed CPU usage is however > >>>>> 100%. > >>>>> > >>>>> At this time, kernel appears to be busy handling page faults (~700k per > >>>>> second): > >>>>> > >>>>> # perf top -a -g > >>>>> - 98.97% 8.30% a.out [.] map_write_unmap > >>>>> - 23.52% map_write_unmap > >>>>> - 24.29% el0_sync > >>>>> - 10.42% do_mem_abort > >>>>> - 17.81% do_translation_fault > >>>>> - 33.01% do_page_fault > >>>>> - 56.18% handle_mm_fault > >>>>> 40.26% __handle_mm_fault > >>>>> 2.19% __ll_sc___cmpxchg_case_acq_4 > >>>>> 0.87% mem_cgroup_from_task > >>>>> - 6.18% find_vma > >>>>> 5.38% vmacache_find > >>>>> 1.35% __ll_sc___cmpxchg_case_acq_8 > >>>>> 1.23% __ll_sc_atomic64_sub_return_release > >>>>> 0.78% down_read_trylock > >>>>> 0.93% do_translation_fault > >>>>> + 8.30% thread_start > >>>>> > >>>>> # perf stat -p 8189 -d > >>>>> ^C > >>>>> Performance counter stats for process id '8189': > >>>>> > >>>>> 984.311350 task-clock (msec) # 1.000 CPUs > >>>>> utilized > >>>>> 0 context-switches # 0.000 K/sec > >>>>> 0 cpu-migrations # 0.000 K/sec > >>>>> 723,641 page-faults # 0.735 M/sec > >>>>> 2,559,199,434 cycles # 2.600 GHz > >>>>> 711,933,112 instructions # 0.28 insn > >>>>> per > >>>>> cycle > >>>>> <not supported> branches > >>>>> 757,658 branch-misses > >>>>> 205,840,557 L1-dcache-loads # 209.121 M/sec > >>>>> 40,561,529 L1-dcache-load-misses # 19.71% of all > >>>>> L1-dcache hits > >>>>> <not supported> LLC-loads > >>>>> <not supported> LLC-load-misses > >>>>> > >>>>> 0.984454892 seconds time elapsed > >>>>> > >>>>> With some extra traces, it appears looping in page fault for same > >>>>> address, > >>>>> over and over: > >>>>> do_page_fault // mm_flags: 0x55 > >>>>> __do_page_fault > >>>>> __handle_mm_fault > >>>>> handle_pte_fault > >>>>> ptep_set_access_flags > >>>>> if (pte_same(pte, entry)) // pte: e8000805060f53, > >>>>> entry: > >>>>> e8000805060f53 > >>>>> > >>>>> I had traces in mmap() and munmap() as well, they don't get hit when > >>>>> reproducer > >>>>> hits the bad state. > >>>>> > >>>>> Notes: > >>>>> - I'm not able to reproduce this on x86. > >>>>> - Attaching GDB or strace immediatelly recovers application from stall. > >>>>> - It also seems to recover faster when system is busy with other tasks. > >>>>> - MAP_SHARED vs. MAP_PRIVATE makes no difference. > >>>>> - Turning off THP makes no difference. > >>>>> - Reproducer [1] usually hits it within ~minute on HW described below. > >>>>> - Longman mentioned that "When the rwsem becomes reader-owned, it > >>>>> causes > >>>>> all the spinning writers to go to sleep adding wakeup latency to > >>>>> the time required to finish the critical sections", but this looks > >>>>> like busy loop, so I'm not sure if it's related to rwsem issues > >>>>> identified > >>>>> in: > >>>>> https://lore.kernel.org/lkml/20190428212557.13482-2-longman@xxxxxxxxxx/ > >>>> It sounds possible to me. What the optimization done by the commit ("mm: > >>>> mmap: zap pages with read mmap_sem in munmap") is to downgrade write > >>>> rwsem to read when zapping pages and page table in munmap() after the > >>>> vmas have been detached from the rbtree. > >>>> > >>>> So the mmap(), which is writer, in your test may steal the lock and > >>>> execute with the munmap(), which is the reader after the downgrade, in > >>>> parallel to break the mutual exclusion. > >>>> > >>>> In this case, the parallel mmap() may map to the same area since vmas > >>>> have been detached by munmap(), then mmap() may create the complete same > >>>> vmas, and page fault happens on the same vma at the same address. > >>>> > >>>> I'm not sure why gdb or strace could recover this, but they use ptrace > >>>> which may acquire mmap_sem to break the parallel inadvertently. > >>>> > >>>> May you please try Waiman's patch to see if it makes any difference? > >>> I don't see any difference in behaviour after applying: > >>> [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count > >>> before increment > >>> Issue is still easily reproducible for me. > >>> > >>> I'm including output of mem_abort_decode() / show_pte() for sample PTE, > >>> that > >>> I see in page fault loop. (I went through all bits, but couldn't find > >>> anything invalid about it) > >>> > >>> mem_abort_decode: Mem abort info: > >>> mem_abort_decode: ESR = 0x92000047 > >>> mem_abort_decode: Exception class = DABT (lower EL), IL = 32 bits > >>> mem_abort_decode: SET = 0, FnV = 0 > >>> mem_abort_decode: EA = 0, S1PTW = 0 > >>> mem_abort_decode: Data abort info: > >>> mem_abort_decode: ISV = 0, ISS = 0x00000047 > >>> mem_abort_decode: CM = 0, WnR = 1 > >>> show_pte: user pgtable: 64k pages, 48-bit VAs, pgdp = > >>> 0000000067027567 > >>> show_pte: [0000ffff6dff0000] pgd=000000176bae0003 > >>> show_pte: , pud=000000176bae0003 > >>> show_pte: , pmd=000000174ad60003 > >>> show_pte: , pte=00e80008023a0f53 > >>> show_pte: , pte_pfn: 8023a > >>> > >>> >>> print bin(0x47) > >>> 0b1000111 > >>> > >>> Per D12-2779 (ARM Architecture Reference Manual), > >>> ISS encoding for an exception from an Instruction Abort: > >>> IFSC, bits [5:0], Instruction Fault Status Code > >>> 0b000111 Translation fault, level 3 > >>> > >>> --- > >>> > >>> My theory is that TLB is getting broken. > > Theory continued: > > > > unmap_region() is batching updates to TLB (for vmas and page tables). > > And at the same time another thread handles page fault for same mm, > > which increases "tlb_flush_pending". > > > > tlb_finish_mmu() called from unmap_region() will thus set 'force = 1'. > > And arch_tlb_finish_mmu() will in turn reset TLB range, presumably making > > it smaller then it would be if force == 0. > > > > Change below appears to fix it: > > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > index f2f03c655807..a4cef21bd62b 100644 > > --- a/mm/mmu_gather.c > > +++ b/mm/mmu_gather.c > > @@ -93,7 +93,7 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb, > > struct mmu_gather_batch *batch, *next; > > > > if (force) { > > - __tlb_reset_range(tlb); > > __tlb_adjust_range(tlb, start, end - start); > > I don't get why the change could fix it? My guess is that reset clears "tlb->freed_tables", which changes how tlb_flush() operates, see "bool last_level = !tlb->freed_tables;" in arch/arm64/include/asm/tlb.h. Maybe that doesn't clear some intermediate entries? No clue. If I let it reset the range, but preserve "freed_tables", it also seems to solve the problem: diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index f2f03c655807..17fb0d7edc03 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -93,8 +93,20 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb, struct mmu_gather_batch *batch, *next; if (force) { - __tlb_reset_range(tlb); + if (tlb->fullmm) { + tlb->start = tlb->end = ~0; + } else { + tlb->start = TASK_SIZE; + tlb->end = 0; + } __tlb_adjust_range(tlb, start, end - start); > __tlb_reset_range() just reset > start and end to TASK_SIZE and 0, then __tlb_adjust_range() set proper > start and end. I don't get why "force" flush smaller range? I'm still trying to understand this part. It's actually not smaller, but it changes: unmap_region() # vm_start: ffff49bd0000 vm_end: ffff49be0000 ... # tlb.start, tlb.end: 1000000000000 0 free_pgtables() # tlb.start, tlb.end: ffff40000000 ffff40010000 tlb_finish_mmu() arch_tlb_finish_mmu() # will see force == 1 # resets tlb.start, tlb.end to: ffff49bd0000 ffff49be0000 > > > } > > > >>> I made a dummy kernel module that exports debugfs file, which on read > >>> triggers: > >>> flush_tlb_all(); > >>> > >>> Any time reproducer stalls and I read debugfs file, it recovers > >>> immediately and resumes printing to stdout. > >> That commit doesn't change anything about TLB flush, just move zapping > >> pages under read mmap_sem as what MADV_DONTNEED does. > >> > >> I don't have aarch64 board to reproduce and debug it. And, I'm not > >> familiar with aarch64 architecture either. But, some history told me the > >> parallel zapping page may run into stale TLB and defer a flush meaning > >> that this call may observe pte_none and fails to flush the TLB. But, > >> this has been solved by commit 56236a59556c ("mm: refactor TLB gathering > >> API") and 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss > >> problem"). > >> > >> For more detail, please refer to commit 4647706ebeee ("mm: always flush > >> VMA ranges affected by zap_page_range"). Copied Mel and Rik in this > >> thread. Also added Will Deacon and Catalin Marinas, who are aarch64 > >> maintainers, in this loop > > Thanks > > > >> But, your test (triggering TLB flush) does demonstrate TLB flush is > >> *not* done properly at some point as expected for aarch64. Could you > >> please give the below patch a try? > > Your patch also fixes my reproducer. > > Thanks for testing it. > > > > >> diff --git a/mm/memory.c b/mm/memory.c > >> index ab650c2..ef41ad5 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -1336,8 +1336,10 @@ void unmap_vmas(struct mmu_gather *tlb, > >> > >> mmu_notifier_range_init(&range, vma->vm_mm, start_addr, > >> end_addr); > >> mmu_notifier_invalidate_range_start(&range); > >> - for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) > >> + for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { > >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); > >> + flush_tlb_range(vma, start_addr, end_addr); > >> + } > >> mmu_notifier_invalidate_range_end(&range); > >> } > >> > >>>>> - I tried 2 different aarch64 systems so far: APM X-Gene CPU Potenza A3 > >>>>> and > >>>>> Qualcomm 65-LA-115-151. > >>>>> I can reproduce it on both with v5.1-rc7. It's easier to reproduce > >>>>> on latter one (for longer periods of time), which has 46 CPUs. > >>>>> - Sample output of reproducer on otherwise idle system: > >>>>> # ./a.out > >>>>> [00000314] map_write_unmap took: 26305 ms > >>>>> [00000867] map_write_unmap took: 13642 ms > >>>>> [00002200] map_write_unmap took: 44237 ms > >>>>> [00002851] map_write_unmap took: 992 ms > >>>>> [00004725] map_write_unmap took: 542 ms > >>>>> [00006443] map_write_unmap took: 5333 ms > >>>>> [00006593] map_write_unmap took: 21162 ms > >>>>> [00007435] map_write_unmap took: 16982 ms > >>>>> [00007488] map_write unmap took: 13 ms^C > >>>>> > >>>>> I ran a bisect, which identified following commit as first bad one: > >>>>> dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") > >>>>> > >>>>> I can also make the issue go away with following change: > >>>>> diff --git a/mm/mmap.c b/mm/mmap.c > >>>>> index 330f12c17fa1..13ce465740e2 100644 > >>>>> --- a/mm/mmap.c > >>>>> +++ b/mm/mmap.c > >>>>> @@ -2844,7 +2844,7 @@ EXPORT_SYMBOL(vm_munmap); > >>>>> SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > >>>>> { > >>>>> profile_munmap(addr); > >>>>> - return __vm_munmap(addr, len, true); > >>>>> + return __vm_munmap(addr, len, false); > >>>>> } > >>>>> > >>>>> # cat /proc/cpuinfo | head > >>>>> processor : 0 > >>>>> BogoMIPS : 40.00 > >>>>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid > >>>>> asimdrdm > >>>>> CPU implementer : 0x51 > >>>>> CPU architecture: 8 > >>>>> CPU variant : 0x0 > >>>>> CPU part : 0xc00 > >>>>> CPU revision : 1 > >>>>> > >>>>> # numactl -H > >>>>> available: 1 nodes (0) > >>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > >>>>> 23 > >>>>> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 > >>>>> node 0 size: 97938 MB > >>>>> node 0 free: 95732 MB > >>>>> node distances: > >>>>> node 0 > >>>>> 0: 10 > >>>>> > >>>>> Regards, > >>>>> Jan > >>>>> > >>>>> [1] > >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap5.c > >>>>> [2] > >>>>> https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/config > >> > >