On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote: > On 29.09.22 20:29, Chih-En Lin wrote: > > On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote: > > > > > IMHO, a relaxed form that focuses on only the memory consumption reduction > > > > > could *possibly* be accepted upstream if it's not too invasive or complex. > > > > > During fork(), we'd do exactly what we used to do to PTEs (increment > > > > > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O, > > > > > duplicate swap entries; all while holding the page table lock), however, > > > > > sharing the prepared page table with the child process using COW after we > > > > > prepared it. > > > > > > > > > > Any (most once we want to *optimize* rmap handling) modification attempts > > > > > require breaking COW -- copying the page table for the faulting process. But > > > > > at that point, the PTEs are already write-protected and properly accounted > > > > > (refcount/mapcount/PageAnonExclusive). > > > > > > > > > > Doing it that way might not require any questionable GUP hacks and swapping, > > > > > MMU notifiers etc. "might just work as expected" because the accounting > > > > > remains unchanged" -- we simply de-duplicate the page table itself we'd have > > > > > after fork and any modification attempts simply replace the mapped copy. > > > > > > > > Agree. > > > > However for GUP hacks, if we want to do the COW to page table, we still > > > > need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to > > > > check whether the PTE table is available or not before we do the COW to > > > > the table). Otherwise, it will be more complicated since it might need > > > > to handle situations like while preparing the COW work, it just figuring > > > > out that it needs to duplicate the whole table and roll back (recover > > > > the state and copy it to new table). Hopefully, I'm not wrong here. > > > > > > The nice thing is that GUP itself *usually* doesn't modify page tables. One > > > corner case is follow_pfn_pte(). All other modifications should happen in > > > the actual fault handler that has to deal with such kind of unsharing either > > > way when modifying the PTE. > > > > > > If the pages are already in a COW-ed pagetable in the desired "shared" state > > > (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such > > > pages will just work as expected and we shouldn't be surprised by another > > > set of GUP+COW CVEs. > > > > > > We'd really only deduplicate the page table and not play other tricks with > > > the actual page table content that differ from the existing way of handling > > > fork(). > > > > > > I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when > > > not modifying the page table. I think we only need "we have to unshare this > > > page table now" in follow_pfn_pte() and inside the fault handling when GUP > > > triggers a fault. > > > > > > I hope my assumption is correct, or am I missing something? > > > > > > > My consideration is when we pinned the page and did the COW to make the > > page table be shared. It might not allow mapping the pinned page to R/O) > > into both processes. > > > > So, if the fork is working on the shared state, it needs to recover the > > table and copy to a new one since that pinned page will need to copy > > immediately. We can hold the shared state after occurring such a > > situation. So we still need some trick to let the fork() know which page > > table already has the pinned page (or such page won't let us share) > > before going to duplicate. > > > > Am I wrong here? > > I think you might be overthinking this. Let's keep it simple: > > 1) Handle pinned anon pages just as I described below, falling back to the > "slow" path of page table copying. > > 2) Once we passed that stage, you can be sure that the COW-ed page table > cannot have actually pinned anon pages. All anon pages in such a page table > have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in > pinning these pages anymore, because it will only pin exclusive anon pages! > > 3) If anybody wants to take a R/O pin on a shared anon page that is mapped > into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead > of pinning the page. This has to break COW on the page table and properly > map an exclusive anon page into it, breaking COW. > > Do you see a problem with that? > > > > > After that, since we handled the accounting in fork(), we don't need > > ownership (pmd_t pointer) anymore. We have to find another way to mark > > the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is > > stored at that space.) > > > > > > > > > > > But devil is in the detail (page table lock, TLB flushing). > > > > > > > > Sure, it might be an overhead in the page fault and needs to be handled > > > > carefully. ;) > > > > > > > > > "will make fork() even have more overhead" is not a good excuse for such > > > > > complexity/hacks -- sure, it will make your benchmark results look better in > > > > > comparison ;) > > > > > > > > ;);) > > > > I think that, even if we do the accounting with the COW page table, it > > > > still has a little bit improve. > > > > > > :) > > > > > > My gut feeling is that this is true. While we have to do a pass over the > > > parent page table during fork and wrprotect all PTEs etc., we don't have to > > > duplicate the page table content and allocate/free memory for that. > > > > > > One interesting case is when we cannot share an anon page with the child > > > process because it maybe pinned -- and we have to copy it via > > > copy_present_page(). In that case, the page table between the parent and the > > > child would differ and we'd not be able to share the page table. > > > > That is what I want to say above. > > The case might happen in the middle of the shared page table progress. > > It might cost more overhead to recover it. Therefore, if GUP wants to > > pin the mapped page we can mark the PTE table first, so fork() won't > > waste time doing the work for sharing. > > Having pinned pages is a corner case for most apps. No need to worry about > optimizing this corner case for now. > > I see what you are trying to optimize, but I don't think this is needed in a > first version, and probably never is needed. > > > Any attempts to mark page tables in a certain way from GUP > (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast > (get_user_pages_fast) can race with pretty much anything, even with > concurrent fork. I suspect your current code might be really racy in that > regard. I see. Now, I know why optimizing that corner case is not worth it. Thank you for explaining that. Thanks, Chih-En Lin