+LKML. Also attached the kernel patch that enlarges the race window and the user space test code raceremap.c, which you can put in will-it-scale's tests directory and run it as: # ./raceremap_threads -t 2 -s 1 Make sure "cpu0 runs" appeared in the first line. If you get SEGFAULT: [aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1 cpu0 runs cpu1 runs cpu0: going to remap testcase:mremap warmup cpu1: going to clean the page cpu1: going to write 2 min:0 max:0 total:0 min:0 max:0 total:0 cpu0: remap done Segmentation fault That means the race doesn't occur. If you get "*cpu1 wrote 2 gets lost": [aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1 cpu1 runs testcase:mremap warmup cpu0 runs cpu0: going to remap cpu1: going to clean the page cpu1: going to write 2 cpu1 wrote 2 min:0 max:0 total:0 min:0 max:0 total:0 cpu0: remap done *cpu1 wrote 2 gets lost That means the race occurred. Thanks, Aaron On Thu, Nov 10, 2016 at 05:16:33PM +0800, Aaron Lu wrote: > Prior to 3.15, there was a race between zap_pte_range() and > page_mkclean() where writes to a page could be lost. Dave Hansen > discovered by inspection that there is a similar race between > move_ptes() and page_mkclean(). > > We've been able to reproduce the issue by enlarging the race window with > a msleep(), but have not been able to hit it without modifying the code. > So, we think it's a real issue, but is difficult or impossible to hit > in practice. > > The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split > 'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And > this patch is to fix the race between page_mkclean() and mremap(). > > Here is one possible way to hit the race: suppose a process mmapped a > file with READ | WRITE and SHARED, it has two threads and they are bound > to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread 1 > did a write to addr X so that CPU1 now has a writable TLB for addr X > on it. Thread 2 starts mremaping from addr X to Y while thread 1 cleaned > the page and then did another write to the old addr X again. The 2nd > write from thread 1 could succeed but the value will get lost. > > thread 1 thread 2 > (bound to CPU1) (bound to CPU2) > > 1: write 1 to addr X to get a > writeable TLB on this CPU > > 2: mremap starts > > 3: move_ptes emptied PTE for addr X > and setup new PTE for addr Y and > then dropped PTL for X and Y > > 4: page laundering for N by doing > fadvise FADV_DONTNEED. When done, > pageframe N is deemed clean. > > 5: *write 2 to addr X > > 6: tlb flush for addr X > > 7: munmap (Y, pagesize) to make the > page unmapped > > 8: fadvise with FADV_DONTNEED again > to kick the page off the pagecache > > 9: pread the page from file to verify > the value. If 1 is there, it means > we have lost the written 2. > > *the write may or may not cause segmentation fault, it depends on > if the TLB is still on the CPU. > > Please note that this is only one specific way of how the race could > occur, it didn't mean that the race could only occur in exact the above > config, e.g. more than 2 threads could be involved and fadvise() could > be done in another thread, etc. > > For anonymous pages, they could race between mremap() and page reclaim: > THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge PMD > gets unmapped/splitted/pagedout before the flush tlb happened for the old > huge PMD in move_page_tables() and we could still write data to it. > The normal anonymous page has similar situation. > > To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and > if any, did the flush before dropping the PTL. If we did the flush for > every move_ptes()/move_huge_pmd() call then we do not need to do the > flush in move_pages_tables() for the whole range. But if we didn't, we > still need to do the whole range flush. > > Alternatively, we can track which part of the range is flushed in > move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole > range in move_page_tables(). But that would require multiple tlb flushes > for the different sub-ranges and should be less efficient than the > single whole range flush. > > KBuild test on my Sandybridge desktop doesn't show any noticeable change. > v4.9-rc4: > real 5m14.048s > user 32m19.800s > sys 4m50.320s > > With this commit: > real 5m13.888s > user 32m19.330s > sys 4m51.200s > > Reported-by: Dave Hansen <dave.hansen@xxxxxxxxx> > Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx> > --- > include/linux/huge_mm.h | 2 +- > mm/huge_memory.c | 9 ++++++++- > mm/mremap.c | 30 +++++++++++++++++++++--------- > 3 files changed, 30 insertions(+), 11 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 9b9f65d99873..e35e6de633b9 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -22,7 +22,7 @@ extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > unsigned char *vec); > extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > unsigned long new_addr, unsigned long old_end, > - pmd_t *old_pmd, pmd_t *new_pmd); > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush); > extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, pgprot_t newprot, > int prot_numa); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index cdcd25cb30fe..eff3de359d50 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1426,11 +1426,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > unsigned long new_addr, unsigned long old_end, > - pmd_t *old_pmd, pmd_t *new_pmd) > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) > { > spinlock_t *old_ptl, *new_ptl; > pmd_t pmd; > struct mm_struct *mm = vma->vm_mm; > + bool force_flush = false; > > if ((old_addr & ~HPAGE_PMD_MASK) || > (new_addr & ~HPAGE_PMD_MASK) || > @@ -1455,6 +1456,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > new_ptl = pmd_lockptr(mm, new_pmd); > if (new_ptl != old_ptl) > spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > + if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd)) > + force_flush = true; > pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd); > VM_BUG_ON(!pmd_none(*new_pmd)); > > @@ -1467,6 +1470,10 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd)); > if (new_ptl != old_ptl) > spin_unlock(new_ptl); > + if (force_flush) > + flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); > + else > + *need_flush = true; > spin_unlock(old_ptl); > return true; > } > diff --git a/mm/mremap.c b/mm/mremap.c > index da22ad2a5678..6ccecc03f56a 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -104,11 +104,13 @@ static pte_t move_soft_dirty_pte(pte_t pte) > static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > unsigned long old_addr, unsigned long old_end, > struct vm_area_struct *new_vma, pmd_t *new_pmd, > - unsigned long new_addr, bool need_rmap_locks) > + unsigned long new_addr, bool need_rmap_locks, bool *need_flush) > { > struct mm_struct *mm = vma->vm_mm; > pte_t *old_pte, *new_pte, pte; > spinlock_t *old_ptl, *new_ptl; > + bool force_flush = false; > + unsigned long len = old_end - old_addr; > > /* > * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma > @@ -146,6 +148,14 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > new_pte++, new_addr += PAGE_SIZE) { > if (pte_none(*old_pte)) > continue; > + > + /* > + * We are remapping a dirty PTE, make sure to > + * flush TLB before we drop the PTL for the > + * old PTE or we may race with page_mkclean(). > + */ > + if (pte_present(*old_pte) && pte_dirty(*old_pte)) > + force_flush = true; > pte = ptep_get_and_clear(mm, old_addr, old_pte); > pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); > pte = move_soft_dirty_pte(pte); > @@ -156,6 +166,10 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, > if (new_ptl != old_ptl) > spin_unlock(new_ptl); > pte_unmap(new_pte - 1); > + if (force_flush) > + flush_tlb_range(vma, old_end - len, old_end); > + else > + *need_flush = true; > pte_unmap_unlock(old_pte - 1, old_ptl); > if (need_rmap_locks) > drop_rmap_locks(vma); > @@ -201,13 +215,12 @@ unsigned long move_page_tables(struct vm_area_struct *vma, > if (need_rmap_locks) > take_rmap_locks(vma); > moved = move_huge_pmd(vma, old_addr, new_addr, > - old_end, old_pmd, new_pmd); > + old_end, old_pmd, new_pmd, > + &need_flush); > if (need_rmap_locks) > drop_rmap_locks(vma); > - if (moved) { > - need_flush = true; > + if (moved) > continue; > - } > } > split_huge_pmd(vma, old_pmd, old_addr); > if (pmd_trans_unstable(old_pmd)) > @@ -220,11 +233,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma, > extent = next - new_addr; > if (extent > LATENCY_LIMIT) > extent = LATENCY_LIMIT; > - move_ptes(vma, old_pmd, old_addr, old_addr + extent, > - new_vma, new_pmd, new_addr, need_rmap_locks); > - need_flush = true; > + move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, > + new_pmd, new_addr, need_rmap_locks, &need_flush); > } > - if (likely(need_flush)) > + if (need_flush) > flush_tlb_range(vma, old_end-len, old_addr); > > mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); > -- > 2.5.5 >
>From c529dfa6bdfc643a9c3debb4b61b9b0c13b0862e Mon Sep 17 00:00:00 2001 From: Aaron Lu <aaron.lu@xxxxxxxxx> Date: Thu, 17 Nov 2016 15:11:08 +0800 Subject: [PATCH] mremap: add a 2s delay for MAP_FIXED case Add a 2s delay for MAP_FIXED case to enlarge the race window so that we can hit the race in user space. Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx> --- fs/exec.c | 2 +- include/linux/mm.h | 2 +- mm/mremap.c | 19 ++++++++++++------- 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 4e497b9ee71e..1e49ce9a23bd 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -619,7 +619,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) * process cleanup to remove whatever mess we made. */ if (length != move_page_tables(vma, old_start, - vma, new_start, length, false)) + vma, new_start, length, false, false)) return -ENOMEM; lru_add_drain(); diff --git a/include/linux/mm.h b/include/linux/mm.h index a92c8d73aeaf..5e35fe3d914a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1392,7 +1392,7 @@ int vma_is_stack_for_current(struct vm_area_struct *vma); extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, - bool need_rmap_locks); + bool need_rmap_locks, bool delay); extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot, int dirty_accountable, int prot_numa); diff --git a/mm/mremap.c b/mm/mremap.c index da22ad2a5678..8e35279ca622 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -22,6 +22,7 @@ #include <linux/mmu_notifier.h> #include <linux/uaccess.h> #include <linux/mm-arch-hooks.h> +#include <linux/delay.h> #include <asm/cacheflush.h> #include <asm/tlbflush.h> @@ -166,7 +167,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, - bool need_rmap_locks) + bool need_rmap_locks, bool delay) { unsigned long extent, next, old_end; pmd_t *old_pmd, *new_pmd; @@ -224,8 +225,11 @@ unsigned long move_page_tables(struct vm_area_struct *vma, new_vma, new_pmd, new_addr, need_rmap_locks); need_flush = true; } - if (likely(need_flush)) + if (likely(need_flush)) { + if (delay) + msleep(2000); flush_tlb_range(vma, old_end-len, old_addr); + } mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end); @@ -234,7 +238,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma, static unsigned long move_vma(struct vm_area_struct *vma, unsigned long old_addr, unsigned long old_len, - unsigned long new_len, unsigned long new_addr, bool *locked) + unsigned long new_len, unsigned long new_addr, + bool *locked, bool delay) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *new_vma; @@ -273,7 +278,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, return -ENOMEM; moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, - need_rmap_locks); + need_rmap_locks, delay); if (moved_len < old_len) { err = -ENOMEM; } else if (vma->vm_ops && vma->vm_ops->mremap) { @@ -287,7 +292,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, * and then proceed to unmap new area instead of old. */ move_page_tables(new_vma, new_addr, vma, old_addr, moved_len, - true); + true, delay); vma = new_vma; old_len = new_len; old_addr = new_addr; @@ -442,7 +447,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len, if (offset_in_page(ret)) goto out1; - ret = move_vma(vma, addr, old_len, new_len, new_addr, locked); + ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, true); if (!(offset_in_page(ret))) goto out; out1: @@ -576,7 +581,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; } - ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked); + ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked, false); } out: if (offset_in_page(ret)) { -- 2.5.5
#define _GNU_SOURCE #define _XOPEN_SOURCE 500 #include <sched.h> #include <sys/mman.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <assert.h> #include <sys/io.h> #define BUFLEN 4096 static char wistmpfile[] = "/mnt/willitscale.XXXXXX"; char *testcase_description = "mremap"; char *buf; char *newbuf = (char *)0x700000000000; #define FILE_SIZE (4096*128) static void mdelay(int ms) { int i; // gain io permission for the delay assert(ioperm(0x80, 8, 1) == 0); for (i = 0; i < ms; i++) inb(0x80); } void testcase_prepare(void) { int fd = mkstemp(wistmpfile); assert(fd >= 0); assert(pwrite(fd, "X", 1, FILE_SIZE-1) == 1); buf = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); assert(buf != (void *)-1); close(fd); } static volatile int step = 0; void testcase(unsigned long long *iterations) { int cpu = sched_getcpu(); int fd = open(wistmpfile, O_RDWR); off_t offset = sched_getcpu() * BUFLEN; long counterread = 0; long *counterbuf = (void *)&buf[offset]; assert(fd >= 0); printf("cpu%d runs\n", cpu); while (1) { int ret; if (cpu == 0) { void *tmpbuf; // wait for step 1 done while (step < 1); // step 2: start mremap to have the old PTE emptied printf("cpu%d: going to remap\n", cpu); step = 2; tmpbuf = mremap(buf, FILE_SIZE, FILE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, newbuf); assert(tmpbuf == newbuf); printf("cpu%d: remap done\n", cpu); pause(); } // step 1: dirty the old PTE *counterbuf = 1; step = 1; while (step < 2); // step 3: clean this page // delay a little while to give mremap some time // to empty the old PTE and setup new PTE mdelay(1000); printf("cpu%d: going to clean the page\n", cpu); posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED); // step 4: now the page is cleaned, its new PTE is // write protected but since mremap didn't flush tlb // for the old PTE yet, we could still access the old // addr and that will not dirty anything printf("cpu%d: going to write 2\n", cpu); *counterbuf = 2; printf("cpu%d wrote 2\n", cpu); // step 5: drop this page from page cache and then // read it back to verify if the last write gets lost // munmap the page first, or the FADV_DONTNEED won't // kick the page out of page cache munmap(newbuf + offset, BUFLEN); posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED); ret = pread(fd, &counterread, sizeof(counterread), offset); assert(ret == sizeof(counterread)); if (counterread != 2) { printf("*cpu%d wrote 2 gets lost\n", cpu); fflush(stdout); } exit(0); } } void testcase_cleanup(void) { unlink(wistmpfile); }