> > > Currently, copy-on-write is only used for the mapped memory; the child > > > process still needs to copy the entire page table from the parent > > > process during forking. The parent process might take a lot of time and > > > memory to copy the page table when the parent has a big page table > > > allocated. For example, the memory usage of a process after forking with > > > 1 GB mapped memory is as follows: > > > > For some reason, I was not able to reproduce performance improvements > > with a simple fork() performance measurement program. The results that > > I saw are the following: > > > > Base: > > Fork latency per gigabyte: 0.004416 seconds > > Fork latency per gigabyte: 0.004382 seconds > > Fork latency per gigabyte: 0.004442 seconds > > COW kernel: > > Fork latency per gigabyte: 0.004524 seconds > > Fork latency per gigabyte: 0.004764 seconds > > Fork latency per gigabyte: 0.004547 seconds > > > > AMD EPYC 7B12 64-Core Processor > > Base: > > Fork latency per gigabyte: 0.003923 seconds > > Fork latency per gigabyte: 0.003909 seconds > > Fork latency per gigabyte: 0.003955 seconds > > COW kernel: > > Fork latency per gigabyte: 0.004221 seconds > > Fork latency per gigabyte: 0.003882 seconds > > Fork latency per gigabyte: 0.003854 seconds > > > > Given, that page table for child is not copied, I was expecting the > > performance to be better with COW kernel, and also not to depend on > > the size of the parent. > > Yes, the child won't duplicate the page table, but fork will still > traverse all the page table entries to do the accounting. > And, since this patch expends the COW to the PTE table level, it's not > the mapped page (page table entry) grained anymore, so we have to > guarantee that all the mapped page is available to do COW mapping in > the such page table. > This kind of checking also costs some time. > As a result, since the accounting and the checking, the COW PTE fork > still depends on the size of the parent so the improvement might not > be significant. The current version of the series does not provide any performance improvements for fork(). I would recommend removing claims from the cover letter about better fork() performance, as this may be misleading for those looking for a way to speed up forking. In my case, I was looking to speed up Redis OSS, which relies on fork() to create consistent snapshots for driving replicates/backups. The O(N) per-page operation causes fork() to be slow, so I was hoping that this series, which does not duplicate the VA during fork(), would make the operation much quicker. > Actually, at the RFC v1 and v2, we proposed the version of skipping > those works, and we got a significant improvement. You can see the > number from RFC v2 cover letter [1]: > "In short, with 512 MB mapped memory, COW PTE decreases latency by 93% > for normal fork" I suspect the 93% improvement (when the mapcount was not updated) was only for VAs with 4K pages. With 2M mappings this series did not provide any benefit is this correct? > > However, it might break the existing logic of the refcount/mapcount of > the page and destabilize the system. This makes sense. > [1] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@xxxxxxxxx/T/#me2340d963c2758a2561c39cb3baf42c478dfe548 > [2] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@xxxxxxxxx/T/#mbc33221f00c7cf3d71839b45fc23862a5dac3014