On 01.04.22 21:15, Nadav Amit wrote: > [ +Rick ] > >> On Apr 1, 2022, at 3:13 AM, David Hildenbrand <david@xxxxxxxxxx> wrote: >> >> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we >> can try mapping anonymous pages writable if they are exclusive, >> the PTE is already dirty, and no special handling applies. Mapping the >> PTE writable is essentially the same thing the write fault handler would do >> in this case. > > In general I am all supportive for such a change. > > I do have some mostly-minor concerns. Hi Nadav, thanks a lot for your review! > >> >> +static inline bool can_change_pte_writable(struct vm_area_struct *vma, >> + unsigned long addr, pte_t pte, >> + unsigned long cp_flags) >> +{ >> + struct page *page; >> + >> + if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT)) >> + /* >> + * MM_CP_DIRTY_ACCT is only expressive for shared mappings; >> + * without MM_CP_DIRTY_ACCT, there is nothing to do. >> + */ >> + return false; >> + >> + if (!(vma->vm_flags & VM_WRITE)) >> + return false; >> + >> + if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte)) >> + return false; > > If pte_write() is already try then return false? I understand you want > to do so because the page is already writable, but it is confusing. I thought about just doing outside of the function if ((vma->vm_flags & VM_WRITE) && !pte_write(pte) && can_change_pte_writable()... I refrained from doing so because the sequence of checks might be sub-optimal. But most probably we don't really care about that and it might make the code easier to grasp. Would that make it clearer? > > In addition, I am not sure about the pte_dirty() check is really robust. > I mean I think it is ok, but is there any issue with shadow-stack? Judging that it's already used that way for VMAs with dirty tracking, I assume it's ok. Without checking that the PTE is dirty, we'd have to do a: pte_mkwrite(pte_mkwrite(ptent)); Which would set the pte and consequently the page dirty, although there might not even be a write access. That's what we want to avoid here. > > And this also assumes the kernel does not clear the dirty bit without > clearing the present, as otherwise the note in Intel SDM section 4.8 > ("Accessed and Dirty Flags”) will be relevant and dirty bit might be > set unnecessarily. I think it is ok. Yeah, I think so as well. > >> + >> + /* Do we need write faults for softdirty tracking? */ >> + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) && >> + (vma->vm_flags & VM_SOFTDIRTY)) > > If !IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) then VM_SOFTDIRTY == 0. So I do not > think the IS_ENABLED() is necessary (unless you think it is clearer this > way). Right, we can just do if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte)) and it should get fully optimized out. Thanks! > >> + return false; >> + >> + /* Do we need write faults for uffd-wp tracking? */ >> + if (userfaultfd_pte_wp(vma, pte)) >> + return false; >> + >> + if (!(vma->vm_flags & VM_SHARED)) { >> + /* >> + * We can only special-case on exclusive anonymous pages, >> + * because we know that our write-fault handler similarly would >> + * map them writable without any additional checks while holding >> + * the PT lock. >> + */ >> + page = vm_normal_page(vma, addr, pte); > > I guess we cannot call vm_normal_page() twice, once for prot_numa and once > here, in practice... I guess we could, but it doesn't necessarily make the code easier to read :) And we want to skip protnone either way. > >> + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) >> + return false; >> + } >> + >> + return true; >> +} > > Note that there is a small downside to all of that. Assume you mprotect() > a single page from RO to RW and you have many threads. > > With my pending patch you would avoid the TLB shootdown (and get a PF). > With this patch you would get a TLB shootdown and save the PF. IOW, I > think it is worthy to skip the shootdown as well in such a case and > instead flush the TLB on spurious page-faults. But I guess that’s for > another patch. Just so I understand correctly: your optimization avoids the flush when effectively, nothing changed (R/O -> R/O). And the optimization for this case here would be, to avoid the TLB flush when similarly not required (R/O -> R/W). Correct? -- Thanks, David / dhildenb