On Wed, Apr 23, 2014 at 11:41 AM, Jan Kara <jack@xxxxxxx> wrote: > > Now I'm not sure how to fix Linus' patches. For all I care we could just > rip out pte dirty bit handling for file mappings. However last time I > suggested this you corrected me that tmpfs & ramfs need this. I assume this > is still the case - however, given we unconditionally mark the page dirty > for write faults, where exactly do we need this? Honza, you're missing the important part: it does not matter one whit that we unconditionally mark the page dirty, when we do it *early*, and it can be then be marked clean before it's actually clean! The problem is that page cleaning can clean the page when there are still writers dirtying the page. Page table tear-down removes the entry from the page tables, but it's still there in the TLB on other CPU's. So other CPU's are possibly writing to the page, when clear_page_dirty_for_io() has marked it clean (because it didn't see the page table entries that got torn down, and it hasn't seen the dirty bit in the page yet). I'm including Dave Hansen's "racewrite.c" with his commentary: "This is a will-it-scale test-case which handles all the thread creation and CPU binding for me: https://github.com/antonblanchard/will-it-scale . Just stick the test case in tests/. I also loopback-mounted a ramfs file as an ext4 filesystem on /mnt to make sure the writeback could happen fast. This reproduces the bug pretty darn quickly and with as few as 4 threads running like this: ./racewrite_threads -t 4 -s 999" and "It reproduces in about 5 seconds on my 4770 on an unpatched kernel. It also reproduces on a _normal_ filesystem and doesn't apparently need the loopback-mounted ext4 ramfs file that I was trying before." so this can actually be triggered. Linus
#define _GNU_SOURCE #define _XOPEN_SOURCE 500 #include <sched.h> #include <sys/mman.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <assert.h> #define BUFLEN 4096 static char wistmpfile[] = "/mnt/willitscale.XXXXXX"; char *testcase_description = "Same file pwrite"; char *buf; #define FILE_SIZE (4096*1024) void testcase_prepare(void) { int fd = mkstemp(wistmpfile); assert(fd >= 0); assert(pwrite(fd, "X", 1, FILE_SIZE-1) == 1); buf = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); assert(buf != (void *)-1); close(fd); } void testcase(unsigned long long *iterations) { int cpu = sched_getcpu(); int fd = open(wistmpfile, O_RDWR); off_t offset = sched_getcpu() * BUFLEN; long counter = 0; long counterread = 0; long *counterbuf = (void *)&buf[offset]; printf("offset: %ld\n", offset); printf(" buf: %p\n", buf); printf("counterbuf: %p\n", counterbuf); assert(fd >= 0); while (1) { int ret; if (cpu == 1) { ret = madvise(buf, FILE_SIZE, MADV_DONTNEED); continue; } *counterbuf = counter; posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED); ret = pread(fd, &counterread, sizeof(counterread), offset); assert(ret == sizeof(counterread)); if (counterread != counter) { printf("cpu: %d\n", cpu); printf(" counter %ld\n", counter); printf("counterread %ld\n", counterread); printf("*counterbuf %ld\n", *counterbuf); while(1); } counter++; (*iterations)++; } } void testcase_cleanup(void) { unlink(wistmpfile); }