On 7/1/21 6:46 AM, Kaiyang Zhao wrote: > The design is that instead of copying the entire paging tree during the > fork invocation, we make the child and the parent process share the same > set of last-level page tables, which will be reference counted. To preserve > the copy-on-write semantics, we disable the write permission in PMD entries > in fork, and copy PTE tables as needed in the page fault handler. That's clever. But, I'm not sure it's comprehensive. How, for instance, do you handle get_user_pages() users that don't actually write to the mappings? Or, memory reclaim where the kernel itself goes and zaps page table entries without accessing the mapping itself that's being zapped. I would have expected a *lot* more pervasive changes to page table walkers across the kernel. Oh, and the code itself makes my eyes bleed. You might want to spend a bit of time to clean out the debug printk()s and make sure this gets somewhere close to passing checkpatch.pl if you want to be taken more seriously. For example: > + if (pte_present(ptent)) { > + struct page *page; > + > + if (pte_special(ptent)) { //known special pte: vvar VMA, which has just one page shared system-wide. Shouldn't matter > + continue; > + } > + page = vm_normal_page(NULL, addr, ptent); //kyz : vma is not important > + if (unlikely(!page)) > + continue; > + rss[mm_counter(page)]--; > +#ifdef CONFIG_DEBUG_VM > + // printk("zap_one_pte_table: addr=%lx, end=%lx, (before) mapcount=%d, refcount=%d\n", addr, end, page_mapcount(page), page_ref_count(page)); > +#endif > + page_remove_rmap(page, false); > + put_page(page); > + }