On 21.05.22 20:50, Chih-En Lin wrote: > On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote: >> On 19.05.22 20:31, Chih-En Lin wrote: >>> When creating the user process, it usually uses the Copy-On-Write (COW) >>> mechanism to save the memory usage and the cost of time for copying. >>> COW defers the work of copying private memory and shares it across the >>> processes as read-only. If either process wants to write in these >>> memories, it will page fault and copy the shared memory, so the process >>> will now get its private memory right here, which is called break COW. >> >> Yes. Lately we've been dealing with advanced COW+GUP pinnings (which >> resulted in PageAnonExclusive, which should hit upstream soon), and >> hearing about COW of page tables (and wondering how it will interact >> with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes >> me feel a bit uneasy :) > > I saw the series patch of this and knew how complicated handling COW of > the physical page was [1][2][3][4]. So the COW page table will tend to > restrict the sharing only to the page table. This means any modification > to the physical page will trigger the break COW of page table. > > Presently implementation will only update the physical page information > to the RSS of the owner process of COW PTE. Generally owner is the > parent process. And the state of the page, like refcount and mapcount, > will not change under the COW page table. > > But if any situations will lead to the COW page table needs to consider > the state of physical page, it might be fretful. ;-) I haven't looked into the details of how GUP deals with these COW page tables. But I suspect there might be problems with page pinning: skipping copy_present_page() even for R/O pages is usually problematic with R/O pinnings of pages. I might be just wrong. > >>> >>> Presently this kind of technology is only used as the mapping memory. >>> It still needs to copy the entire page table from the parent. >>> It might cost a lot of time and memory to copy each page table when the >>> parent already has a lot of page tables allocated. For example, here is >>> the state table for mapping the 1 GB memory of forking. >>> >>> mmap before fork mmap after fork >>> MemTotal: 32746776 kB 32746776 kB >>> MemFree: 31468152 kB 31463244 kB >>> AnonPages: 1073836 kB 1073628 kB >>> Mapped: 39520 kB 39992 kB >>> PageTables: 3356 kB 5432 kB >> >> >> I'm missing the most important point: why do we care and why should we >> care to make our COW/fork implementation even more complicated? >> >> Yes, we might save some page tables and we might reduce the fork() time, >> however, which specific workload really benefits from this and why do we >> really care about that workload? Without even hearing about an example >> user in this cover letter (unless I missed it), I naturally wonder about >> relevance in practice. >> >> I assume it really only matters if we fork() realtively large processes, >> like databases for snapshotting. However, fork() is already a pretty >> sever performance hit due to COW, and there are alternatives getting >> developed as a replacement for such use cases (e.g., uffd-wp). >> >> I'm also missing a performance evaluation: I'd expect some simple >> workloads that use fork() might be even slower after fork() with this >> change. >> > > The paper mentioned a list of benchmarks of the time cost for On-Demand > fork. For example, on Redis, the meantime of fork when taking the > snapshot. Default fork() got 7.40 ms; On-demand Fork (COW PTE table) got > 0.12 ms. But there are some other cases, like the Response latency > distribution of Apache HTTP Server, are not have significant benefits > from their On-demand fork. Thanks. I expected that snapshotting would pop up and be one of the most prominent users that could benefit. However, for that specific use case I am convinced that uffd-wp is the better choice and fork() is just the old way of doing it. having nothing better at hand. QEMU already implements snapshotting of VMs that way and I remember that redis also intended to implement support for uffd-wp. Not sure what happened with that and if there is anything missing to make it work. > > For the COW page table from this patch, I also take the perf to analyze > the cost time. But it looks like not different from the default fork. Interesting, thanks for sharing. > > Here is the report, the mmap-sfork is COW page table version: > > Performance counter stats for './mmap-fork' (100 runs): > > 373.92 msec task-clock # 0.992 CPUs utilized ( +- 0.09% ) > 1 context-switches # 2.656 /sec ( +- 6.03% ) > 0 cpu-migrations # 0.000 /sec > 881 page-faults # 2.340 K/sec ( +- 0.02% ) > 1,860,460,792 cycles # 4.941 GHz ( +- 0.08% ) > 1,451,024,912 instructions # 0.78 insn per cycle ( +- 0.00% ) > 310,129,843 branches # 823.559 M/sec ( +- 0.01% ) > 1,552,469 branch-misses # 0.50% of all branches ( +- 0.38% ) > > 0.377007 +- 0.000480 seconds time elapsed ( +- 0.13% ) > > Performance counter stats for './mmap-sfork' (100 runs): > > 373.04 msec task-clock # 0.992 CPUs utilized ( +- 0.10% ) > 1 context-switches # 2.660 /sec ( +- 6.58% ) > 0 cpu-migrations # 0.000 /sec > 877 page-faults # 2.333 K/sec ( +- 0.08% ) > 1,851,843,683 cycles # 4.926 GHz ( +- 0.08% ) > 1,451,763,414 instructions # 0.78 insn per cycle ( +- 0.00% ) > 310,270,268 branches # 825.352 M/sec ( +- 0.01% ) > 1,649,486 branch-misses # 0.53% of all branches ( +- 0.49% ) > > 0.376095 +- 0.000478 seconds time elapsed ( +- 0.13% ) > > So, the COW of the page table may reduce the time of forking. But it > builds on the transfer of the copy work to other modified operations > to the physical page. Right. > >> I have tons of questions regarding rmap, accounting, GUP, page table >> walkers, OOM situations in page walkers, but at this point I am not >> (yet) convinced that the added complexity is really worth it. So I'd >> appreciate some additional information. > > It seems like I have a lot of work to do. ;-) Messing with page tables and COW is usually like opening a can of worms :) -- Thanks, David / dhildenb