> On Oct 13, 2021, at 10:10 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Wed, Oct 13, 2021 at 03:42:08PM -0700, Nadav Amit wrote: >> Andrea, Peter, others, > > Hi, Nadav, > >> >> I encountered many unnecessary COW operations on my development kernel >> (based on Linux 5.13), which I did not see a report about and I am not >> sure how to solve. An advice would be appreciated. >> >> Commit 09854ba94c6aa ("mm: do_wp_page() simplification”) prevents the reuse of >> a page on write-protect fault if page_count(page) != 1. In that case, >> wp_page_reuse() is not used and instead the page is COW'd by wp_page_copy >> (). wp_page_copy() is obviously much more expensive, not only because of the >> copying, but also because it requires a TLB flush and potentially a TLB >> shootodwn. >> >> The scenario I encountered happens when I use userfaultfd, but presumably it >> might happen regardless of userfaultfd (perhaps swap device with >> SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new >> anonymous page as read-only and a second write-protect fault that happens >> shortly after on the same page. In this case the page count is almost always >> elevated and therefore a COW is needed. >> [ snip ] >> >> It turns out that the elevated page count is due to the caching of the page in >> the local LRU cache (by lru_cache_add() which is called by >> lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since the >> first fault happened shortly before the second write-protect fault, the LRU >> cache was still not drained, so the page count was not decreased and a COW is >> needed. >> >> Calling lru_add_drain() during this flow resolves the issue most of the time. >> Obviously, it needs to be called on the core that allocated (i.e., faulted >> in) the page initially to work. It is possible to do it conditionally only if >> the page-count is greater than 1. > > I agree with your analysis. I didn't even notice the lru_cache_add() can cause > it very likely to trigger the COW in your uffd use case (and also for swap), > but that's indeed something could happen with the current page reuse logic in > do_wp_page(), afaiu. Just an update for the record based on an offline correspondence with Andrea and Peter, who were very helpful (thanks!) I could not come up with a non-hacky solution just for this problem. While it is possible to drain the LRU conditionally, it is admittedly a hack with some downsides. The aforementioned issue - unnecessary TLB flush (or even shootdown) on COW operations - is not limited to userfaultfd and not even to SWP_SYNCHRONOUS_IO. It seems that whenever the swap is set on very low-latency device (e.g., pmem, zram), the unnecessary COW might happen and impact performance negatively. I created a small test to verify the impact of the phenomenon (the test code is below). The swap is set on an emulated pmem device and then run with: ./forceswap 2 100000 1 The benchmark runs 100k rounds in which a page is accessed first for read, then for write, and then the page is paged out using MADV_PAGEOUT. The two accesses cause a page-fault. The test only measures the time of the second access, which should include the wp page-fault. I also measured the delta in “nr_tlb_remote_flush" from /proc/vmstat. The results are: cycles/op nr_tlb_remote_flush ------------------------------------------------------------------- v5.8 bcf876870b95 1606 300000 mainline cb690f5238d7 10534 399935 As shown, the write-protect fault in mainline takes ~6.5x, which is explained by the COW operation that is exhibited in the extra TLB shootdown (nr_tlb_remote_flush). On bare-metal this overhead should be lower, yet if the number of threads is higher the overhead would increase. I tried also to collect the number of IOs, but for some reason they do not show on /sys/dev/block/X/stat for pmem. [ Some config details: KVM VM running on Haswell. host: max-freq; kvm_intel's ple_gap=0; 2MB pages. VM: mitigations=off idle=poll. Kernel compiled with CONFIG_DEBUG_TLBFLUSH=y. CONFIG_BLK_DEV_PMEM=y ] -- >8 -- #include <pthread.h> #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include <sys/mman.h> #include <errno.h> #include <sys/types.h> #include <unistd.h> #define PAGE_SIZE (4096) #define MAX_THREADS (50) volatile int stop = 0; unsigned long nops; void* thread_start(void *arg) { while (!stop) { asm volatile ("pause" ::: "memory"); } return (void*)NULL; } static inline uint64_t rdtscp() { uint64_t rax, rdx, aux; asm volatile ("rdtscp\n" : "=a" (rax), "=d" (rdx), "=c" (aux) : : ); return (rdx << 32) + rax; } int main(int argc, char *argv[]) { int r, nthreads, npages, j; unsigned long i; pthread_attr_t attr; pthread_t thread_ids[MAX_THREADS]; void *res; volatile char *p, c; uint64_t time = 0; if (argc < 4) { fprintf(stderr, "usage: %s [nthreads] [nops] [npages]\n", argv[0]); exit(-1); } r = pthread_attr_init(&attr); if (r != 0) { fprintf(stderr, "error setting attributes %d\n", r); exit(-1); } nthreads = atoi(argv[1]); nops = strtoull(argv[2], NULL, 0); npages = atoi(argv[3]); for (i = 0; i < nthreads - 1; i++) { r = pthread_create(&thread_ids[i], &attr, &thread_start, NULL); if (r != 0) { fprintf(stderr, "error creating thread %d\n", r); exit(-1); } } p = (volatile char*)mmap(0, PAGE_SIZE * npages, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (p == MAP_FAILED) { perror("mmap"); exit(-1); } for (i = 0; i < nops; i++) { if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) { perror("madvise"); exit(-1); } for (j = 0; j < npages; j++) { c = p[j * PAGE_SIZE]; c++; time -= rdtscp(); p[j * PAGE_SIZE] = c; time += rdtscp(); } } stop = 1; for (i = 0; i < nthreads - 1; i++) { r = pthread_join(thread_ids[i], &res); if (r != 0) { fprintf(stderr, "error join\n"); exit(-1); } } printf("time: %ld\n", time/nops); return 0; }