On Tue, Sep 15, 2020 at 03:29:33PM -0300, Jason Gunthorpe wrote: > On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote: > > > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote: > > > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK > > > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy > > > > of the page and MADV_DONTFORK was needed to ensure that a future fork > > > > doesn't establish a COW that would break the DMA by moving the > > > > physical page over to the fork. DMA should stay with the process that > > > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with > > > > recent years work to GUP/etc? It is a pretty terrible ancient thing) > > > > > > ... Now I'm more confused on what has happened. > > > > I'm going to try to confirm that the MADV_DONTFORK is actually being > > done by userspace properly, more later. > > It turns out the test is broken and does not call MADV_DONTFORK when > doing forks - it is an opt-in it didn't do. > > It looks to me like this patch makes it much more likely that the COW > break after page pinning will end up moving the pinned physical page > to the fork while before it was not very common. Does that make sense? My understanding is that the fix should not matter much with current failing test case, as long as it's with FOLL_FORCE & FOLL_WRITE. However what I'm not sure is what if the RDMA/DMA buffers are designed for pure read from userspace. E.g. for vfio I'm looking at vaddr_get_pfn() where I believe such pure read buffers will be a GUP with FOLL_PIN and !FOLL_WRITE which will finally pass to pin_user_pages_remote(). So what I'm worrying is something like this: 1. Proc A gets a private anon page X for DMA, mapcount==refcount==1. 2. Proc A fork()s and gives birth to proc B, page X will now have mapcount==refcount==2, write-protected. proc B quits. Page X goes back to mapcount==refcount==1 (note! without WRITE bits set in the PTE). 3. pin_user_pages(write=false) for page X. Since it's with !FORCE & !WRITE, no COW needed. Refcount==2 after that. 4. Pass these pages to device. We either setup IOMMU page table or just use the PFNs, which is not important imho - the most important thing is the device will DMA into page X no matter what. 5. Some thread of proc A writes to page X, trigger COW since it's write-protected with mapcount==1 && refcount==2. The HVA that pointing to page X will be changed to point to another page Y after the COW. 6. Device DMA happens, data resides on X. Proc A can never get the data, though, because it's looking at page Y now. If this is a problem, we may still need the fix patch (maybe not as urgent as before at least). But I'd like to double confirm, just in case I miss some obvious facts above. > > Given that the tests are wrong it seems like broken userspace, > however, it also worked reliably for a fairly long time. IMHO it worked because the page to do RDMA has mapcount==1, so it was reused previously just as-is even after the fork without MADV_DONTFORK and after the child quits. However logically it should really be protected by MADV_DONTFORK rather than being reused. Thanks, -- Peter Xu