On Tue, Sep 10, 2024 at 08:16:10AM -0400, Peter Xu wrote: > On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote: > > Hi Peter, > > Hi, Yan, > > > > > Not sure if I missed anything. > > > > It looks that before this patch, pmd/pud are alawys write protected without > > checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect() > > clears dirty bit by moving the dirty value to the software bit. > > > > And I have a question that why previously pmd/pud are always write protected. > > IIUC this is a separate question - the move of dirty bit in pud_wrprotect() > is to avoid wrongly creating shadow stack mappings. In our discussion I > think that's an extra complexity and can be put aside; the dirty bit will > get recovered in pud_clear_saveddirty() later, so it's not the same as > pud_mkclean(). But pud_clear_saveddirty() will only set dirty bit when write bit is 1. > > AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we > will not duplicate pgtables in fork() for most of shared file mappings > (!CoW). Please refer to vma_needs_copy(), and the comment before returning > false at last. I think it's not strictly is_cow_mapping(), as we're > checking anon_vma there, however it's mostly it, just to also cover > MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW > happened then anon_vma will appear already). > > There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps. > Userfault & mixedmap are not involved in this series at all, so let's > discuss pfnmaps. > > It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant > to this series, because before this series pfnmap only exists in pte level, > hence IMO the is_cow_mapping() must exist for pte level as you described, > because it needs to properly take care of those. Note that in the pte > processing it also checks pte_write() to make sure it's a COWed page, not a > RO page cache / pfnmap / ..., for example. > > Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that > pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise > the whole copy_page_range() could be already skipped. IOW I think they > only need to process COWed pages here, and those pages require write bit > removed in both parent and child when fork(). Is it also based on that there's no MAP_SHARED huge DEVMAP pages up to now? > > After this series, pfnmaps can appear in the form of pmd/pud, then the > previous assumption will stop holding true, as we'll still copy pfnmaps > during fork() always. My guessing of the reason is because most of the > drivers map pfnmap vmas only during mmap(), it means there can normally > have no fault() handler at all for those pfns. > > In this case, we'll need to also identify whether the page is COWed, using > the newly added "is_cow_mapping() && pxx_write()" in this series (added > to pud path, while for pmd path I used a WARN_ON_ONCE instead). > > If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork() > we'll wrongly observe write protected entries. Here the change will make > sure VM_SHARED can properly persist the write bits on pmds/puds. > > Hope that explains. > Thanks a lot for such detailed explanation! >