On Wed, 4 Dec 2013 16:07:41 +0000 Mel Gorman <mgorman@xxxxxxx> wrote: > Because I found it impossible to segfault processes under any level of > scanning and numa hinting fault stress after it was applied I think I still managed to trigger the bug, by setting numa page scanning to ludicrous speed, and running two large specjbb2005 processes on a 4 node system in an infinite loop :) I believe the reason is your patch flushes the TLB too late, after the page contents have been migrated over to the new page. The changelog below should explain how the race works, and how this patch supposedly fixes it. If it doesn't, let me know and I'll go back to the drawing board :) ---8<--- Subject: mm,numa: fix memory corrupter race between THP NUMA unmap and migrate There is a subtle race between THP NUMA migration, and the NUMA unmapping code. The NUMA unmapping code does a permission change on pages, which is done with a batched (deferred) TLB flush. This is normally safe, because the pages stay in the same place, and having other CPUs continue to access them until the TLB flush is indistinguishable from having other CPUs do those same accesses before the PTE permission change. The THP NUMA migration code normally does not do a remote TLB flush, because the PTE is marked inaccessible, meaning no other CPUs should have cached TLB entries that allow them to access the memory. However, the following race is possible: CPU A CPU B CPU C load TLB entry make entry PMD_NUMA fault on entry write to page start migrating page write to page change PMD to new page flush TLB reload TLB from new entry lose data The obvious fix is to flush remote TLB entries from the numa migrate code on CPU B, while CPU A is making PTE changes, and has the TLB flush batched up for later. The migration for 4kB pages is currently fine, because it calls mk_ptenonnuma before migrating the page, which causes the migration code to always do a remote TLB flush. We should probably optimize that at some point... Signed-off-by: Rik van Riel <riel@xxxxxxxxxx> --- include/linux/mm_types.h | 3 +++ kernel/sched/core.c | 1 + kernel/sched/fair.c | 4 ++++ mm/huge_memory.c | 10 ++++++++++ 4 files changed, 18 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 261ff4a..fa67ddb 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -427,6 +427,9 @@ struct mm_struct { /* numa_scan_seq prevents two threads setting pte_numa */ int numa_scan_seq; + + /* task_numa_work is unmapping pages, with deferred TLB flush */ + bool numa_tlb_lazy; #endif struct uprobes_state uprobes_state; }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5f14335..fe80455 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1732,6 +1732,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) if (p->mm && atomic_read(&p->mm->mm_users) == 1) { p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); p->mm->numa_scan_seq = 0; + p->mm->numa_tlb_lazy = false; } if (clone_flags & CLONE_VM) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2ec4afb..c9440f3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1722,7 +1722,11 @@ void task_numa_work(struct callback_head *work) start = max(start, vma->vm_start); end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); end = min(end, vma->vm_end); + wmb(); /* with do_huge_pmd_numa_page */ + mm->numa_tlb_lazy = true; nr_pte_updates += change_prot_numa(vma, start, end); + wmb(); /* with do_huge_pmd_numa_page */ + mm->numa_tlb_lazy = false; /* * Scan sysctl_numa_balancing_scan_size but ensure that diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d68066f..3a03370 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1385,6 +1385,16 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* + * Another CPU is currently turning ptes of this process into + * NUMA ptes. That permission change batches the TLB flush, + * so other CPUs may still have valid TLB entries pointing to + * the current page. Make sure those are flushed before we + * migrate to a new page. + */ + rmb(); /* with task_numa_work */ + if (mm->numa_tlb_lazy) + flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + /* * Migrate the THP to the requested node, returns with page unlocked * and pmd_numa cleared. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>