On 19.03.22 00:48, Jason Gunthorpe wrote: > On Tue, Mar 15, 2022 at 03:18:30PM +0100, David Hildenbrand wrote: >> This is just the natural follow-up of part 2, that will also further >> reduce "wrong COW" on the swapin path, for example, when we cannot remove >> a page from the swapcache due to concurrent writeback, or if we have two >> threads faulting on the same swapped-out page. Fixing O_DIRECT is just a >> nice side-product :) Hi Jason, thanks or the review! > > I know I would benefit alot from a description of the swap specific > issue a bit more. Most of this message talks about clear_refs which I > do understand a bit better. Patch #1 contains some additional information. In general, it's the same issue as with any other mechanism that could get the page mapped R/O while there is a FOLL_GET | FOLL_WRITE reference to it -- for example, DMA to that page as happens with our O_DIRECT reproducer. Part 2 essentially fixed the other cases (i.e., clear_refs), but the remaining swapout+refault from swapcache case is handled in this series. > > Is this talking about what happens after a page gets swapped back in? > eg the exclusive bit is missing when the page is recreated? Right, try_to_unmap() was the last remaining case where we'd have lost the exclusivity information -- it wasn't required for reliable GUP pins in part 2. Here is what happens without PG_anon_exclusive: 1. The application uses parts of an anonymous base page for direct I/O, let's assume the first 512 bytes of page. fd = open(filename, O_DIRECT| ...); pread(fd, page, 512, 0); O_DIRECT will take a FOLL_GET|FOLL_WRITE reference on the page 2. Reclaim kicks in and wants to swapout the page -- mm/vmscan.c shrink_page_list() first adds the page to the swapcache and then unmaps it via try_to_unmap(). After the page was successfully unmapped, pageout() will start triggering writeback but will realize that there are additional references on the page (via is_page_cache_freeable()) and fail. 3. The application uses unrelated parts of the page for other purposes while the DMA is not completed, e.g., doing a a simple page[4095]++; The read access will fault in the page readable from the swap cache in do_swap_page(). The write access will trigger our COW fault handler. As we have an additional reference on the page, we will create a copy and map it into out page table. At this point, the page table and the GUP reference are out of sync. 4. O_DIRECT completes The read targets the page that is no longer referenced in the page tables. For the application, it looks like the read() never happened, as we lost our DMA read to our page. With PG_anon_exclusive from series part 2, we don't remember exclusivity information in try_to_unmap() yet. do_swap_page() cannot restore it as it has to assume the page is possibly shared. With this series, we remember exclusivity information in try_to_unmap() in the SWP PTE. do_swap_page() can restore it. Consequently, our COW fault handler won't create a wrong copy and we won't go out of sync between GUP and the page mapped into the page table. Hope that helps! -- Thanks, David / dhildenb