On Sat, Mar 07, 2020 at 09:33:08PM +0100, David Hildenbrand wrote: > On 20.02.20 16:53, Peter Xu wrote: > > [Resend v6] > > > > This is v6 of the series. It is majorly a rebase to 5.6-rc2, nothing > > else to be expected (plus some tests after the rebase). Instead of > > rewrite the cover letter I decided to use what we have for v5. > > > > Adding extra CCs for both Bobby Powers <bobbypowers@xxxxxxxxx> and > > Brian Geffon <bgeffon@xxxxxxxxxx>. > > > > Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry > > > > Any review comment is appreciated. Thanks, > > If I am not completely missing something (and all my testing today was > wrong) there is a very simple reason why I *LOVE* this series and it made > my weekend. It makes userfaultfd with concurrent discarding (e.g., > MADV_DONTNEED) of pages actually usable. Hi, David, Thanks for doing further test against this branch! > > The issue in current code is that between placing a page and waking > up a waiter, somebody can zap the new placed page and trigger > re-fault, triggering a SIGBUS and crashing an application where all > memory is supposed to be accessible. And there is no real way to protect > from that, because when the fault handler will be woken up and retry > is not deterministic (e.g., making madvise(MADV_DONTNEED) and > UFFDIO_ZEROPAGE mutually exclusive does not help). > > Find a simple reproducer at the end of this mail. > > Before this series: > [root@localhost ~]# ./a.out > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > Progress! > [ 34.849604] FAULT_FLAG_ALLOW_RETRY missing 70 > [ 34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92 > [ 34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 > [ 34.852818] Call Trace: > [ 34.853045] dump_stack+0x8f/0xd0 > [ 34.853338] handle_userfault.cold+0x1a/0x2e > [ 34.853704] ? find_held_lock+0x2b/0x80 > [ 34.854031] ? __handle_mm_fault+0x18c5/0x1900 > [ 34.854409] __handle_mm_fault+0x18d4/0x1900 > [ 34.854784] handle_mm_fault+0x169/0x360 > [ 34.855120] do_user_addr_fault+0x20d/0x490 > [ 34.855478] async_page_fault+0x43/0x50 > [ 34.855809] RIP: 0033:0x401659 > [ 34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8 > [ 34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246 > [ 34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef > [ 34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320 > [ 34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070 > [ 34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120 > [ 34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000 > > After this series: > Well, "Progress!" all day long. Yes, IIUC the race can happen like this in your below test: main thread uffd thread disgard thread =========== =========== ============== access page uffd page fault wait for page UFFDIO_ZEROCOPY put a page P there MADV_DONTNEED on P wakeup main thread return from fault page still missing uffd page fault again (without ALLOW_RETRY) --> SIGBUS. And I agree this should be one if the major issues that this series wants to address. > > > Can we please have a way to identify that this "feature" is available? > I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from > user space easily and use concurrent discards without crashing our applications. I'm not sure how others think about it, but to me this still fells into the bucket of "solving an existing problem" rather than a feature. Also note that this should change the behavior for the page fault logic in general, rather than an uffd-only change. So I'm also not sure whether UFFD_FEAT_* suites here even if we want it. > > > Questions: > 1. I assume KVM will do multiple retries as well, and have the same behavior, right? Yes I believe so, or we'll need to fix it. > > 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE? > For now we were able to trigger a signal this way. If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus even without the help of the MADV_DONTNEED race. > If the behavior is changed, can > we make this configurable via a UFFD_FEAT? I'll still think that could be an overkill, but I'll leave the discussion to the experts. Thanks, > > --- snip --- > #include <string.h> > #include <stdbool.h> > #include <stdint.h> > #include <sys/types.h> > #include <stdio.h> > #include <pthread.h> > #include <errno.h> > #include <unistd.h> > #include <stdlib.h> > #include <fcntl.h> > #include <poll.h> > #include <linux/userfaultfd.h> > #include <sys/mman.h> > #include <sys/syscall.h> > #include <sys/ioctl.h> > > static int page_size; > > static void *fault_handler_thread(void *arg) > { > const long uffd = (long) arg; > struct pollfd pollfd = { > .fd = uffd, > .events = POLLIN, > }; > int ret; > > while (true) { > struct uffdio_zeropage zeropage = {}; > struct uffd_msg msg; > ssize_t nread; > > if (poll(&pollfd, 1, -1) == -1) { > fprintf(stderr, "POLL failed: %s\n", strerror(errno)); > exit(-1); > } > if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) { > fprintf(stderr, "READ failed\n"); > exit(-1); > } > if (msg.event != UFFD_EVENT_PAGEFAULT) { > fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n"); > exit(-1); > } > > zeropage.range.start = msg.arg.pagefault.address; > zeropage.range.len = page_size; > do { > ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage); > if (ret && errno != EAGAIN) { > fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno)); > exit(-1); > } > } while (ret); > } > } > static void *discard_thread(void *arg) > { > while (true) { > if (madvise(arg, page_size, MADV_DONTNEED)) { > fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno)); > exit(-1); > } > usleep(1000); > } > } > > int main(void) > { > struct uffdio_register reg; > struct uffdio_api api = { > .api = UFFD_API, > }; > pthread_t fault, discard; > long uffd; > char *area; > > page_size = sysconf(_SC_PAGE_SIZE); > > uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); > if (uffd == -1) { > fprintf(stderr, "Could not create uffd: %s\n", strerror(errno)); > exit(-1); > } > if (ioctl(uffd, UFFDIO_API, &api) == -1) { > fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno)); > exit(-1); > } > > area = mmap(NULL, page_size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (area == MAP_FAILED) { > fprintf(stderr, "Could not allocate memory"); > exit(-1); > } > > reg.range.start = (uint64_t) area; > reg.range.len = page_size, > reg.mode = UFFDIO_REGISTER_MODE_MISSING; > if (ioctl(uffd, UFFDIO_REGISTER, ®) == -1) { > fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno)); > exit(-1); > } > > /* thread to provide zeropages */ > if (pthread_create(&fault, NULL, fault_handler_thread, > (void *) uffd)) { > fprintf(stderr, "Could not create fault handling thread"); > exit(-1); > } > > /* thread to discard the page */ > if (pthread_create(&discard, NULL, discard_thread, > (void *) area)) { > fprintf(stderr, "Could not create discard thread"); > exit(-1); > } > > /* keep reading/writing the page */ > while (true) { > area[7] = area[1]; > usleep(10000); > printf("Progress!\n"); > } > return 0; > } > > > -- > Thanks, > > David / dhildenb > -- Peter Xu