On 26/03/2024 17:38, David Hildenbrand wrote: > On 26.03.24 18:27, Ryan Roberts wrote: >> On 26/03/2024 17:02, David Hildenbrand wrote: >>> On 15.02.24 13:17, Ryan Roberts wrote: >>>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to >>>> ptep_get_lockless_norecency() to save orig_pte. >>>> >>>> There are a number of places that follow this model: >>>> >>>> orig_pte = ptep_get_lockless(ptep) >>>> ... >>>> <lock> >>>> if (!pte_same(orig_pte, ptep_get(ptep))) >>>> // RACE! >>>> ... >>>> <unlock> >>>> >>>> So we need to be careful to convert all of those to use >>>> pte_same_norecency() so that the access and dirty bits are excluded from >>>> the comparison. >>>> >>>> Additionally there are a couple of places that genuinely rely on the >>>> access and dirty bits of orig_pte, but with some careful refactoring, we >>>> can use ptep_get() once we are holding the lock to achieve equivalent >>>> logic. >>> >>> We really should document that changed behavior somewhere where it can be easily >>> found: that orig_pte might have incomplete/stale accessed/dirty information. >> >> I could add it to the orig_pte definition in the `struct vm_fault`? >> >>> >>> >>>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >>>> vmf->address, &vmf->ptl); >>>> if (unlikely(!vmf->pte)) >>>> return 0; >>>> - vmf->orig_pte = ptep_get_lockless(vmf->pte); >>>> + vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte); >>>> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >>>> >>>> if (pte_none(vmf->orig_pte)) { >>>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >>>> >>>> spin_lock(vmf->ptl); >>>> entry = vmf->orig_pte; >>>> - if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>>> + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) { >>>> update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); >>>> goto unlock; >>> >>> I was wondering about the following: >>> >>> Assume the PTE is not dirty. >>> >>> Thread 1 does >> >> Sorry not sure what threads have to do with this? How is the vmf shared between >> threads? What have I misunderstood... > > Assume we have a HW that does not have HW-managed access/dirty bits. One that > ends up using generic ptep_set_access_flags(). Access/dirty bits are always > updated under PT lock. > > Then, imagine two threads. One is the the fault path here. another thread > performs some other magic that sets the PTE dirty under PTL. > >> >>> >>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) >>> /* not dirty */ >>> >>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ Ahh, this comment about thread 2 is not referring to the code immediately below it. It all makes much more sense now. :) >>> >>> spin_lock(vmf->ptl); >>> entry = vmf->orig_pte; >>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>> ... >>> } >>> ... >>> entry = pte_mkyoung(entry); >> >> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else. > > No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and > unconditionally does that in handle_pte_fault(). > >> >>> if (ptep_set_access_flags(vmf->vma, ...) >>> ... >>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>> >>> >>> Generic ptep_set_access_flags() will do another pte_same() check and realize >>> "hey, there was a change!" let's update the PTE! >>> >>> set_pte_at(vma->vm_mm, address, ptep, entry); >> >> This is called from the generic ptep_set_access_flags() in your example, right? >> > > Yes. > >>> >>> would overwrite the dirty bit set by thread 2. >> >> I'm not really sure what you are getting at... Is your concern that there is a >> race where the page could become dirty in the meantime and it now gets lost? I >> think that's why arm64 overrides ptep_set_access_flags(); since the hw can >> update access/dirty we have to deal with the races. > > My concern is that your patch can in subtle ways lead to use losing PTE dirty > bits on architectures that don't have the HW-managed dirty bit. They do exist ;) But I think the example you give can already happen today? Thread 1 reads orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to set dirty just after the get, then thread 1 is going to set the PTE back to (a modified version of) orig_pte. Isn't it already broken? > > Arm64 should be fine in that regard. > There is plenty of arm64 HW that doesn't do HW access/dirty update. But our ptep_set_access_flags() can always deal with a racing update, even if that update originates from SW. Why do I have the feeling you're about to explain (very patiently) exactly why I'm wrong?... :)