On 24.02.22 13:26, David Hildenbrand wrote: > This series is the result of the discussion on the previous approach [2]. > More information on the general COW issues can be found there. It is based > on [1], which resides in -mm and -next: > [PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for > THP and swap > > I keep the latest state, including some hacky selftest on: > https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2 > > This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken > on an anonymous page and COW logic fails to detect exclusivity of the page > to then replacing the anonymous page by a copy in the page table: The > GUP pin lost synchronicity with the pages mapped into the page tables. > > This issue, including other related COW issues, has been summarized in [3] > under 3): > " > 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) > > page_maybe_dma_pinned() is used to check if a page may be pinned for > DMA (using FOLL_PIN instead of FOLL_GET). While false positives are > tolerable, false negatives are problematic: pages that are pinned for > DMA must not be added to the swapcache. If it happens, the (now pinned) > page could be faulted back from the swapcache into page tables > read-only. Future write-access would detect the pinning and COW the > page, losing synchronicity. For the interested reader, this is nicely > documented in feb889fb40fa ("mm: don't put pinned pages into the swap > cache"). > > Peter reports [8] that page_maybe_dma_pinned() as used is racy in some > cases and can result in a violation of the documented semantics: > giving false negatives because of the race. > > There are cases where we call it without properly taking a per-process > sequence lock, turning the usage of page_maybe_dma_pinned() racy. While > one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to > handle, there is especially one rmap case (shrink_page_list) that's hard > to fix: in the rmap world, we're not limited to a single process. > > The shrink_page_list() issue is really subtle. If we race with > someone pinning a page, we can trigger the same issue as in the FOLL_GET > case. See the detail section at the end of this mail on a discussion how > bad this can bite us with VFIO or other FOLL_PIN user. > > It's harder to reproduce, but I managed to modify the O_DIRECT > reproducer to use io_uring fixed buffers [15] instead, which ends up > using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can > similarly trigger a loss of synchronicity and consequently a memory > corruption. > > Again, the root issue is that a write-fault on a page that has > additional references results in a COW and thereby a loss of > synchronicity and consequently a memory corruption if two parties > believe they are referencing the same page. > " > > This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, > especially also taking care of concurrent pinning via GUP-fast, > for example, also fully fixing an issue reported regarding NUMA > balancing [4] recently. While doing that, it further reduces "unnecessary > COWs", especially when we don't fork()/KSM and don't swapout, and fixes the > COW security for hugetlb for FOLL_PIN. > > In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped > anonymous page is exclusive. Exclusive anonymous pages that are mapped > R/O can directly be mapped R/W by the COW logic in the write fault handler. > Exclusive anonymous pages that want to be shared (fork(), KSM) first have > to mark a mapped anonymous page shared -- which will fail if there are > GUP pins on the page. GUP is only allowed to take a pin on anonymous pages > that is exclusive. The PT lock is the primary mechanism to synchronize > modifications of PG_anon_exclusive. GUP-fast is synchronized either via the > src_mm->write_protect_seq or via clear/invalidate+flush of the relevant > page table entry. > > Special care has to be taken about swap, migration, and THPs (whereby a > PMD-mapping can be converted to a PTE mapping and we have to track > information for subpages). Besides these, we let the rmap code handle most > magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE > logic as part of our previous approach [2], however, it's now 100% mapcount > free and I further simplified it a bit. > > #1 is a fix > #3-#7 are mostly rmap preparations for PG_anon_exclusive handling > #8 introduces PG_anon_exclusive > #9 uses PG_anon_exclusive and make R/W pins of anonymous pages > reliable > #10 is a preparation for reliable R/O pins > #11 and #12 is reused/modified GUP-triggered unsharing for R/O GUP pins > make R/O pins of anonymous pages reliable > #13 adds sanity check when (un)pinning anonymous pages > > I'm not proud about patch #8, suggestions welcome. Patch #9 contains > excessive explanations and the main logic for R/W pins. #11 and #12 > resemble what we proposed in the previous approach [2]. I consider the > general approach of #13 very nice and helpful, and I remember Linus even > envisioning something like that for finding BUGs, although we might want to > implement the sanity checks eventually differently > > It passes my growing set of tests for "wrong COW" and "missed COW", > including the ones in [30 -- I'd really appreciate some experienced eyes > to take a close look at corner cases. Only tested on x86_64, testing with > CONT-mapped hugetlb pages on arm64 might be interesting. > > Once we converted relevant users of FOLL_GET (e.g., O_DIRECT) to FOLL_PIN, > the issue described in [3] under 2) will be fixed as well. Further, once > that's in place we can streamline our COW logic for hugetlb to rely on > page_count() as well and fix any possible COW security issues. Hi, I did excessive tests on aarch64 with CONT hugetlb pages yesterday and didn't find any surprises, at least not in these changes. [1] While thinking on how to get O_DIRECT fixed immediately, I realized the following: (1) This series turns FOLL_PIN on anonymous pages completely reliable, for any case of GUP+fork (GUP before fork, GUP during fork, GUP after fork in child/parent), which is pretty nice IMHO. (2) For O_DIRECT and friends (FOLL_GET) we primarily care about fixing short-term FOLL_WRITE *without* fork(), which are the memory corruptions we're experiencing. Long-term FOLL_GET (meaning, even staying reliable after fork()) already has a bad smell to it. Especially, (2) never worked reliably with fork() involved. I came to the conclusion that this series pretty much fixes (2) already *except* the swapout case. I might be wrong, but I think it should be possible to handle that as well, meaning that: a FOLL_GET|FOLL_WRITE on an anonymous page will be reliable as long as we don't fork(). I'm planning on sending a part3 to cover that, so we don't have to wait for the FOLL_PIN conversion to get our O_DIRECT reproducers fixed. Stay tuned. If there isn't any more feedback, I'll be sending out a v1 soonish. [1] https://lkml.kernel.org/r/811c5c8e-b3a2-85d2-049c-717f17c3a03a@xxxxxxxxxx -- Thanks, David / dhildenb