On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote: > Hi Peter, Hi, Yan, > > Not sure if I missed anything. > > It looks that before this patch, pmd/pud are alawys write protected without > checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect() > clears dirty bit by moving the dirty value to the software bit. > > And I have a question that why previously pmd/pud are always write protected. IIUC this is a separate question - the move of dirty bit in pud_wrprotect() is to avoid wrongly creating shadow stack mappings. In our discussion I think that's an extra complexity and can be put aside; the dirty bit will get recovered in pud_clear_saveddirty() later, so it's not the same as pud_mkclean(). AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we will not duplicate pgtables in fork() for most of shared file mappings (!CoW). Please refer to vma_needs_copy(), and the comment before returning false at last. I think it's not strictly is_cow_mapping(), as we're checking anon_vma there, however it's mostly it, just to also cover MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW happened then anon_vma will appear already). There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps. Userfault & mixedmap are not involved in this series at all, so let's discuss pfnmaps. It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant to this series, because before this series pfnmap only exists in pte level, hence IMO the is_cow_mapping() must exist for pte level as you described, because it needs to properly take care of those. Note that in the pte processing it also checks pte_write() to make sure it's a COWed page, not a RO page cache / pfnmap / ..., for example. Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise the whole copy_page_range() could be already skipped. IOW I think they only need to process COWed pages here, and those pages require write bit removed in both parent and child when fork(). After this series, pfnmaps can appear in the form of pmd/pud, then the previous assumption will stop holding true, as we'll still copy pfnmaps during fork() always. My guessing of the reason is because most of the drivers map pfnmap vmas only during mmap(), it means there can normally have no fault() handler at all for those pfns. In this case, we'll need to also identify whether the page is COWed, using the newly added "is_cow_mapping() && pxx_write()" in this series (added to pud path, while for pmd path I used a WARN_ON_ONCE instead). If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork() we'll wrongly observe write protected entries. Here the change will make sure VM_SHARED can properly persist the write bits on pmds/puds. Hope that explains. Thanks, -- Peter Xu