Currently, copy-on-write is only used for the mapped memory; the child process still needs to copy the entire page table from the parent process during forking. The parent process might take a lot of time and memory to copy the page table when the parent has a big page table allocated. For example, the memory usage of a process after forking with 1 GB mapped memory is as follows: DEFAULT FORK parent child VmRSS: 1049688 kB 1048688 kB VmPTE: 2096 kB 2096 kB This patch introduces copy-on-write (COW) for the PTE level page tables. COW PTE improves performance in the situation where the user needs copies of the program to run on isolated environments. Feedback-based fuzzers (e.g., AFL) and serverless/microservice frameworks are two major examples. For instance, COW PTE achieves a 9.3x throughput increase when running SQLite on a fuzzer (AFL). As COW PTE only boosts performance in some cases, the patch adds a new sysctl, vm.cow_pte, with the input process ID (PID) to allow the user to enable COW PTE for a specific process. To handle the page table state of each process that has a shared PTE table, the patch introduces the concept of COW PTE table ownership. This implementation uses the address of the PMD index to track the ownership of the PTE table. This helps maintain the state of the COW PTE tables, such as the RSS and pgtable_bytes. Some PTE tables (e.g., pinned pages that reside in the table) still need to be copied immediately for consistency with the current COW logic. As a result, a flag, COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE table is exclusive (i.e., only one task owns it at a time) is added to the table’s owner pointer. Every time a PTE table is copied during the fork, the owner pointer (and thus the exclusive flag) will be checked to determine whether the PTE table can be shared across processes. This patch uses a refcount to track the shared page table's lifetime. Invoking fork with COW PTE will increase the refcount. A refcount=1 means that the page table is not currently shared with another process but may be shared. And, when someone writes to the shared PTE table, it will cause the write fault to break COW PTE. If the shared PTE table's refcount is one, the process that triggers the fault will reuse the shared PTE table. Otherwise, the process will decrease the refcount, copy the information to a new PTE table or dereference all the information and change the owner if they have the shared PTE table. After applying COW to PTE, the memory usage after forking is as follows: COW PTE parent child VmRSS: 1049968 kB 2576 kB VmPTE: 2096 kB 44 kB The results show that this patch significantly decreases memory usage. Other improvements such as lower fork latency and page fault latency, which are the major benefits, are discussed later. Real-world applications ======================= We run benchmarks of fuzzing and VM cloning. The experiments were done with the normal fork or the fork with COW PTE. With AFL (LLVM mode) and SQLite, COW PTE (503.67 execs/sec) achieves a 9.3x throughput increase over the normal fork version (53.86 execs/sec). fork execs_per_sec unix_time time count 26.000000 2.600000e+01 26.000000 mean 53.861538 1.663145e+09 84.423077 std 3.715063 5.911357e+01 59.113567 min 35.980000 1.663145e+09 0.000000 25% 54.440000 1.663145e+09 32.250000 50% 54.610000 1.663145e+09 82.000000 75% 54.837500 1.663145e+09 140.750000 max 55.600000 1.663145e+09 178.000000 COW PTE execs_per_sec unix_time time count 36.000000 3.600000e+01 36.000000 mean 503.674444 1.663146e+09 88.916667 std 81.805271 5.369191e+01 53.691912 min 84.910000 1.663146e+09 0.000000 25% 472.952500 1.663146e+09 44.500000 50% 504.700000 1.663146e+09 89.000000 75% 553.367500 1.663146e+09 133.250000 max 568.270000 1.663146e+09 178.000000 With TriforceAFL which is for kernel fuzzing with QEMU, COW PTE (124.31 execs/sec) achieves a 1.3x throughput increase over the normal fork version (96.44 execs/sec). fork execs_per_sec unix_time time count 18.000000 1.800000e+01 18.000000 mean 96.436667 1.663146e+09 84.388889 std 25.260184 6.601795e+01 66.017947 min 6.590000 1.663146e+09 0.000000 25% 91.025000 1.663146e+09 21.250000 50% 100.350000 1.663146e+09 92.000000 75% 111.247500 1.663146e+09 146.750000 max 122.260000 1.663146e+09 169.000000 COW PTE execs_per_sec unix_time time count 22.000000 2.200000e+01 22.000000 mean 124.305455 1.663147e+09 90.409091 std 32.508728 6.033846e+01 60.338457 min 6.590000 1.663146e+09 0.000000 25% 113.227500 1.663146e+09 26.250000 50% 122.435000 1.663147e+09 112.000000 75% 145.792500 1.663147e+09 141.500000 max 161.280000 1.663147e+09 168.000000 Comparison with uffd ==================== For RFC v1, David Hildenbrand mentioned that uffd-wp is a new way of snapshotting in QEMU. There is some overlap between uffd and fork use cases, such as database snapshotting. So the following microbenchmarks also measure the overhead of uffd-wp and uffd-copy-page. To be fair in terms of CPU usage, the uffd handlers are pinned to the same core as the main thread. uffd-wp simulates the work QEMU does with uffd-wp. It will store the page that causes the fault into a memory buffer and remove write protection for that page. Also, uffd-copy-page will allocate the memory and replace the original page that causes the fault. Microbenchmark - syscall/registering latency ============================================= We run microbenchmarks to measure the latency of a fork syscall or registering uffd with sizes of mapped memory ranging from 0 to 512 MB for the use cases that focus on lowering startup time (e.g., serverless frameworks). The results show that the latency of a normal fork and registering uffd-wp reaches 10 ms and 3.9 ms respectively, while the latency of registering uffd-copy-page is around 0.007 ms. The latency of a fork with COW PTE is around 0.625 ms after 200 MB, which is significantly lower than the normal fork/uffd-wp. In short, with 512 MB mapped memory, COW PTE decreases latency by 93% for normal fork and 83% for uffd-wp. Microbenchmark - page fault latency ==================================== We conducted some microbenchmarks to measure page fault latency with different patterns of accesses to a 512 MB memory buffer after forking or registering uffd. In the first experiment, the program accesses the entire 512 MB memory by writing to all the pages consecutively. The experiment is done with normal fork, fork with COW PTE, uffd-wp, and uffd-copy-page and calculates the single access average latency. The result shows that the page fault latency of COW PTE (0.000045 ms) is 59.5x faster than the uffd-wp (0.002676 ms). The low uffd-wp performance is probably because of the cost of switching between kernel and user mode. What is more interesting is that COW PTE also improves the average page fault latency. COW PTE page fault latency (0.000045 ms) is 16.5x lower than the normal fork fault latency (0.000742 ms). Here are the raw numbers: Page fault - Access to the entire 512 MB memory fork mean: 0.000742 ms COW PTE mean: 0.000045 ms uffd (wp) mean: 0.002676 ms uffd (copy-page) mean: 0.008667 ms The second experiment simulates real-world applications with sparse accesses. The program randomly accesses the memory by writing to one random page 1 million times and calculates the average access time. Since the number of fork and COW PTE are too close to each other, we cannot simply conclude which one is faster, so we run both 100 times to get the averages. The result shows that COW PTE (0.000027 ms) is similar to normal fork (0.000028 ms) and is 2.3x faster than uffd-wp (0.000060 ms). Page fault - Random access fork mean: 0.000028 ms COW PTE mean: 0.000027 ms uffd (wp) mean: 0.000060 ms uffd (copy-page) mean: 0.002363 ms All the tests were run with QEMU and the kernel was built with the x86_64 default config. Summary ======= In summary, COW PTE reduces the memory footprint of processes and improves the initialization and page fault latency for various applications, which would be important to some frameworks that require very low execution startup (e.g., serverless framework) or high-throughput short executions of child processes (e.g., testing). This patch is based on the paper "On-demand-fork: a microsecond fork for memory-intensive and latency-sensitive applications" [1] from Purdue University. Any comments and suggestions are welcome. Thanks, Chih-En Lin --- TODO list: - Handle the file-backed and shmem with reclaim. - Handle OOM, KSM, page table walker, and migration. - Deal with TLB flush in the break COW PTE handler. RFC v1 -> RFC v2 - Change the clone flag method to sysctl with PID. - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and MMF_COW_PTE_READY, for the sysctl. - Change the owner pointer to use the folio padding. - Handle all the VMAs that cover the PTE table when doing the break COW PTE. - Remove the self-defined refcount to use the _refcount for the page table page. - Add the exclusive flag to let the page table only own by one task in some situations. - Invalidate address range MMU notifier and start the write_seqcount when doing the break COW PTE. - Handle the swap cache and swapoff. RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@xxxxxxxxx/ [1] https://dl.acm.org/doi/10.1145/3447786.3456258 This patch is based on v6.0-rc5. --- Chih-En Lin (9): mm: Add new mm flags for Copy-On-Write PTE table mm: pgtable: Add sysctl to enable COW PTE mm, pgtable: Add ownership to PTE table mm: Add COW PTE fallback functions mm, pgtable: Add a refcount to PTE table mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag mm: Add the break COW PTE handler mm: Handle COW PTE with reclaim algorithm mm: Introduce Copy-On-Write PTE table include/linux/mm.h | 2 + include/linux/mm_types.h | 5 +- include/linux/pgtable.h | 140 +++++++++++++ include/linux/rmap.h | 2 + include/linux/sched/coredump.h | 8 +- kernel/fork.c | 5 + kernel/sysctl.c | 8 + mm/Makefile | 2 +- mm/cow_pte.c | 39 ++++ mm/gup.c | 13 +- mm/memory.c | 360 ++++++++++++++++++++++++++++++++- mm/mmap.c | 3 + mm/mremap.c | 3 + mm/page_vma_mapped.c | 5 + mm/rmap.c | 2 +- mm/swapfile.c | 1 + mm/vmscan.c | 1 + 17 files changed, 587 insertions(+), 12 deletions(-) create mode 100644 mm/cow_pte.c -- 2.37.3