On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@xxxxxxxxx> wrote: > > v3 -> v4 > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g., > s390 and powerpc32, don't support the PMD entry and PTE table > operations. > - Fix unmatch type of break_cow_pte_range() in > migrate_vma_collect_pmd(). > - Don’t break COW PTE in folio_referenced_one(). > - Fix the wrong VMA range checking in break_cow_pte_range(). > - Only break COW when we modify the soft-dirty bit in > clear_refs_pte_range(). > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c. > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to > tlb_flush_pmd_range(). > - Handle VM_DONTCOPY with COW PTE fork. > - Fix the wrong address and invalid vma in recover_pte_range(). > - Fix the infinite page fault loop in GUP routine. > In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE > handler, we return -EMLINK to let the GUP handles the page fault > (call faultin_page() in __get_user_pages()). > - return not_found(pvmw) if the break COW PTE failed in > page_vma_mapped_walk(). > - Since COW PTE has the same result as the normal COW selftest, it > probably passed the COW selftest. > > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > not ok 33 No leak from parent into child > # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB) > not ok 44 No leak from parent into child > # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB) > not ok 55 No leak from child into parent > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB) > not ok 66 No leak from child into parent > > Bail out! 4 out of 147 tests failed > # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0 > See the more information about anon cow hugetlb tests: > https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@xxxxxxxxxx/ > > > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@xxxxxxxxx/T/ > > RFC v2 -> v3 > - Change the sysctl with PID to prctl(PR_SET_COW_PTE). > - Account all the COW PTE mapped pages in fork() instead of defer it to > page fault (break COW PTE). > - If there is an unshareable mapped page (maybe pinned or private > device), recover all the entries that are already handled by COW PTE > fork, then copy to the new one. > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP, > follow_pfn_pte(). > - Remove the PTE ownership since we don't need it. > - Use pte lock to protect the break COW PTE and free COW-ed PTE. > - Do TLB flushing in break COW PTE handler. > - Handle THP, KSM, madvise, mprotect, uffd and migrate device. > - Handle the replacement page of uprobe. > - Handle the clear_refs_write() of fs/proc. > - All of the benchmarks dropped since the accounting and pte lock. > The benchmarks of v3 is worse than RFC v2, most of the cases are > similar to the normal fork, but there still have an use case > (TriforceAFL) is better than the normal fork version. > > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@xxxxxxxxx/T/ > > RFC v1 -> RFC v2 > - Change the clone flag method to sysctl with PID. > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and > MMF_COW_PTE_READY, for the sysctl. > - Change the owner pointer to use the folio padding. > - Handle all the VMAs that cover the PTE table when doing the break COW PTE. > - Remove the self-defined refcount to use the _refcount for the page > table page. > - Add the exclusive flag to let the page table only own by one task in > some situations. > - Invalidate address range MMU notifier and start the write_seqcount > when doing the break COW PTE. > - Handle the swap cache and swapoff. > > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@xxxxxxxxx/ > > --- > > Currently, copy-on-write is only used for the mapped memory; the child > process still needs to copy the entire page table from the parent > process during forking. The parent process might take a lot of time and > memory to copy the page table when the parent has a big page table > allocated. For example, the memory usage of a process after forking with > 1 GB mapped memory is as follows: For some reason, I was not able to reproduce performance improvements with a simple fork() performance measurement program. The results that I saw are the following: Base: Fork latency per gigabyte: 0.004416 seconds Fork latency per gigabyte: 0.004382 seconds Fork latency per gigabyte: 0.004442 seconds COW kernel: Fork latency per gigabyte: 0.004524 seconds Fork latency per gigabyte: 0.004764 seconds Fork latency per gigabyte: 0.004547 seconds AMD EPYC 7B12 64-Core Processor Base: Fork latency per gigabyte: 0.003923 seconds Fork latency per gigabyte: 0.003909 seconds Fork latency per gigabyte: 0.003955 seconds COW kernel: Fork latency per gigabyte: 0.004221 seconds Fork latency per gigabyte: 0.003882 seconds Fork latency per gigabyte: 0.003854 seconds Given, that page table for child is not copied, I was expecting the performance to be better with COW kernel, and also not to depend on the size of the parent. Test program: #include <time.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/time.h> #include <sys/mman.h> #include <sys/types.h> #define USEC 1000000 #define GIG (1ul << 30) #define NGIG 32 #define SIZE (NGIG * GIG) #define NPROC 16 void main() { int page_size = getpagesize(); struct timeval start, end; long duration, i; char *p; p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); exit(1); } madvise(p, SIZE, MADV_NOHUGEPAGE); /* Touch every page */ for (i = 0; i < SIZE; i += page_size) p[i] = 0; gettimeofday(&start, NULL); for (i = 0; i < NPROC; i++) { int pid = fork(); if (pid == 0) { sleep(30); exit(0); } } gettimeofday(&end, NULL); /* Normolize per proc and per gig */ duration = ((end.tv_sec - start.tv_sec) * USEC + (end.tv_usec - start.tv_usec)) / NPROC / NGIG; printf("Fork latency per gigabyte: %ld.%06ld seconds\n", duration / USEC, duration % USEC); }