On 14.01.22 06:00, zhangliang (AG) wrote: > > > On 2022/1/14 0:48, David Hildenbrand wrote: >> On 13.01.22 17:37, Linus Torvalds wrote: >>> On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >>>> >>>> Let's bring Linus in on this, but I think this reintroduces all of the >>>> mapcount problems that we've been discussing recently. >>>> >>>> How about this as an alternative? >>> >>> No, at that point reuse_swap_page() is the better thing to do. >>> >>> Don't play games with page_count() (or even worse games with >>> swap_count). The page count is only stable if it's 1. Any other value >>> means that it can fluctuate due to concurrent lookups, some of which >>> can be done locklessly under RCU. >> >> I'm pretty sure the patch as is will reintroduce the CVE. So I think in > > Actually, I wonder how reuse_swap_page() in this patch can reintroduce CVE, > I think the invoking logic here is as same as that in do_swap_page(). > So, could you give me some hint about this? Thanks :) The CVE essentially is a) Process A maps an anon page -> refcount = 1, mapcount = 1 b) Process A forks process B -> Page mapped R/O -> refcount = 2, mapcount = 2 c) Process B pins the page R/O (e.g., vmsplice) -> refcount > 2, mapcount = 2 d) Process B unmaps the page -> refcount > 1, mapcount = 1 e) Process A writes to the page If e) ends up reusing the page for process A (e.g., because mapcount == 1), process B can observe the memory modifications via the page pinning As I previously said, our COW logic is completely inconsistent and should be streamlined. Regarding the CVE, what would have to happen is that we put the page into the swapcache and unmap from both processes after c). Then, do_swap_page() would reuse the page during e) similarly. We do have a check in place that protects against pinned pages not being able to enter the swapcache, but IIRC it can race with GUP, so it might race with c) eventually, meaning that with swap there might still be a small chance for the CVE in current upstream IIRC. I think with your patch we could trick do_wp_page() do go down the reuse_swap_page() path and reuse the page, even though there is a pending pin on the page from process B. a) Process A maps an anon page -> refcount = 1, mapcount = 1 b) Page is added to the swapcache and unmapped -> refcount = 1, mapcount = 0 c) Process A reads page -> do_swap_page() will map the page R/O -> the page might remain in the swapcache -> refcount = 2, mapcount = 1 d) Process A forks process B -> Page mapped R/O -> refcount = 3, mapcount = 2 e) Process B pins the page R/O (e.g., vmsplice) -> refcount > 3, mapcount = 2 d) Process B unmaps the page -> refcount > 2, mapcount = 1 e) Process A writes to the page In e), we would go into reuse_swap_page() and could succeed, being left with refcount > 1 and mapcount = 1. But maybe I am missing something :) There are only two real ways I know to handle the scenario above: (1) Rely on page_count() == 1 in COW logic to decide whether we can reuse a page. This comes with false negatives, some being harmful, some just being a performance problem. (2) Rely on mapcount(), swapcount() ... when deciding to reuse a page and when it's safe to take a GUP pin. (2) was discussed in https://lkml.kernel.org/r/20211217113049.23850-1-david@xxxxxxxxxx but was essentially NAKed by Linus in the current form (well, it didn't consider the swapcount so it was just buggy, fortunately Linus pointed that out). To handle problematic false negatives with (1) I'm looking into marking anon pages as exclusively owned by a single process, such that we can simply reuse them in the COW path -- PageAnonExclusive() also discussed inside that posting. That posting also contains a test case which tests all different variants of the "child can observe MAP_PRIVAT modifications of the parent" we know of (e.g., THP, hugetlb), which you might use to verify if the issue is reintroduced by your patch or not. Yesterday, I played with removing reuse_swap_page() completely and streamlining our COW logic like so: (based on https://www.spinics.net/lists/linux-mm/msg279460.html) commit f1fa31a40b13ade5cd93ceb40daadf1196526145 Author: David Hildenbrand <david@xxxxxxxxxx> Date: Fri Jan 14 09:29:52 2022 +0100 mm: optimize do_wp_page() for exclusive pages in the swapcache Let's optimize for a page with a single user that has been added to the swapcache. Try removing the swapcache reference if there is hope that we're the exclusive user, but keep the page_count(page) == 1 check in place. Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page() users next. While at it, remove the superfluous page_mapcount() check: it's implicitly covered by the page_count() for ordinary anon pages. Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> diff --git a/mm/memory.c b/mm/memory.c index e8e2144cbfa6..bd2af7a36791 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3290,12 +3290,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (PageAnon(vmf->page)) { struct page *page = vmf->page; - /* PageKsm() doesn't necessarily raise the page refcount */ - if (PageKsm(page) || page_count(page) != 1) + /* + * PageKsm() doesn't necessarily raise the page refcount. + * + * These checks are racy as long as we haven't locked the page; + * they are a pure optimization to avoid trying to lock the page + * and trying to free the swap cache when there is little hope + * it will actually result in a refcount of 1. + */ + if (PageKsm(page) || page_count(page) <= 1 + PageSwapCache(page)) goto copy; if (!trylock_page(page)) goto copy; - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) { + if (PageSwapCache(page)) + try_to_free_swap(page); + if (PageKsm(page) || page_count(page) != 1) { unlock_page(page); goto copy; } Followed by commit 754aa8f04f7b6d1ffc2b9844be63bbf7b527929f (HEAD) Author: David Hildenbrand <david@xxxxxxxxxx> Date: Fri Jan 14 09:53:20 2022 +0100 mm: streamline COW logic in do_swap_page() Let's apply the same COW logic as in do_wp_page(), conditionally trying to remove the page from the swapcache after freeing the swap entry, however, before actually mapping our page. We can change the order now that reuse_swap_page() is no longer used. Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> diff --git a/mm/memory.c b/mm/memory.c index bd2af7a36791..a1bd2b5c818a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3489,6 +3489,24 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) return 0; } +static inline bool should_try_to_free_swap(struct page *page, + struct vm_area_struct *vma, + unsigned int fault_flags) +{ + if (!PageSwapCache(page)) + return false; + if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) || + PageMlocked(page)) + return true; + /* + * We must only reuse the page if the page_count() is 1. Try to + * free the swapcache to get rid of the swapcache reference in case + * it's likely that we will succeed. + */ + return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) && + page_count(page) == 2; +} + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -3612,7 +3630,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } /* - * Make sure try_to_free_swap or reuse_swap_page or swapoff did not + * Make sure try_to_free_swap or swapoff did not * release the swapcache from under us. The page pin, and pte_same * test below, are not enough to exclude that. Even if it is still * swapcache, we need to check that the page's swap has not changed. @@ -3644,19 +3662,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } /* - * The page isn't present yet, go ahead with the fault. - * - * Be careful about the sequence of operations here. - * To get its accounting right, reuse_swap_page() must be called - * while the page is counted on swap but not yet in mapcount i.e. - * before page_add_anon_rmap() and swap_free(); try_to_free_swap() - * must be called after the swap_free(), or it will never succeed. + * Remove the swap entry and conditionally try to free up the swapcache. + * We're already holding a reference on the page but haven't mapped it + * yet. We won't be waiting on writeback, so if there is concurrent + * writeback we won't map the page writable just now. */ + swap_free(entry); + if (should_try_to_free_swap(page, vma, vmf->flags)) + try_to_free_swap(page); inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { + + /* Same logic as in do_wp_page(). */ + if ((vmf->flags & FAULT_FLAG_WRITE) && + !PageKsm(page) && page_count(page) == 1) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; @@ -3681,10 +3702,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) do_page_add_anon_rmap(page, vma, vmf->address, exclusive); } - swap_free(entry); - if (mem_cgroup_swap_full(page) || - (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) - try_to_free_swap(page); unlock_page(page); if (page != swapcache && swapcache) { /* And followed by more patches that rip out reuse_swap_page() completely. But I still have to understand what's safe to do and not do in do_swap_page(), and which effect it would have. Especially PTE-mapped THP are a head-scratcher for me, and the interaction with THP in the swapcache before/after actual swap. So consider above a WIP prototype that's mostly untested. -- Thanks, David / dhildenb