On Tue, Sep 22, 2020 at 08:27:35PM -0400, Peter Xu wrote: > On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote: > > On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote: > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 8f3521be80ca..6591f3f33299 100644 > > > +++ b/mm/memory.c > > > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > * Because we'll need to release the locks before doing cow, > > > * pass this work to upper layer. > > > */ > > > - if (READ_ONCE(src_mm->has_pinned) && wp && > > > - page_maybe_dma_pinned(page)) { > > > + if (wp && page_maybe_dma_pinned(page) && > > > + READ_ONCE(src_mm->has_pinned)) { > > > /* We've got the page already; we're safe */ > > > data->cow_old_page = page; > > > data->cow_oldpte = *src_pte; > > > > > > I can also add some more comment to emphasize this. > > > > It is not just that, but the ptep_set_wrprotect() has to be done > > earlier. > > Now I understand your point, I think.. So I guess it's not only about > has_pinned, but it should be a race between the fast-gup and the fork() code, > even if has_pinned is always set. Yes > > The best algorithm I've thought of is something like: > > > > pte_map_lock() > > if (page) { > > if (wp) { > > ptep_set_wrprotect() > > /* Order with try_grab_compound_head(), either we see > > * page_maybe_dma_pinned(), or they see the wrprotect */ > > get_page(); > > Is this get_page() a must to be after ptep_set_wrprotect() > explicitly? No, just before page_maybe_dma_pinned() > IIUC what we need is to order ptep_set_wrprotect() and > page_maybe_dma_pinned() here. E.g., would a "mb()" work? mb() is not needed because page_maybe_dma_pinned() has an atomic barrier too. I like to see get_page() followed immediately by page_maybe_dma_pinned() since they are accessing the same atomic and could be fused together someday > Another thing is, do we need similar thing for e.g. gup_pte_range(), so that > to guarantee ordering of try_grab_compound_head() and the pte change > check? gup_pte_range() is as I quoted? The gup slow path ends up in follow_page_pte() which uses the pte lock so is OK. > > Another question is, how about read fast-gup for pinning? Because we can't use > the write-protect mechanism to block a read gup. I remember we've discussed > similar things and iirc your point is "pinned pages should always be with > WRITE". However now I still doubt it... Because I feel like read gup is still > legal (as I mentioned previously - when device purely writes to the page and > the processor only reads from it). We need a definition for what FOLL_PIN means. After this work on fork I propose that FOLL_PIN means: The page is in-use for DMA and the CPU PTE should not be changed without explicit involvement of the application (eg via mmap/munmap) If GUP encounters a read-only page during FOLL_PIN the behavior should depend on what the fault handler would do. If the fault handler would trigger COW and replace the PTE then it violates the above. GUP should do the COW before pinning. If the fault handler would SIGSEGV then GUP can keep the read-only page and allow !FOLL_WRITE access. The PTE should not be replaced for other reasons (though I think there is work there too). For COW related issues the idea is the mm_struct doing the pin will never trigger a COW. When other processes hit the COW they copy the page into their mm and don't touch the source MM's PTE. Today we do this roughly with FOLL_FORCE and FOLL_WRITE in the users, but a more nuanced version and documentation would be much clearer. Unfortunately just doing simple read GUP potentially exposes things to various COW related data corruption races. This is a discussion beyond this series though.. Jason