Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

"zhangliang (AG)" <zhangliang5@xxxxxxxxxx> · Mon, 17 Jan 2022 10:11:08 +0800

Hi, thank you very much for your patient reply, and I completely agree with the viewpoint from you and Linus about COW and reuse_swap_page(). :)

On 2022/1/14 19:23, David Hildenbrand wrote:
> On 14.01.22 06:00, zhangliang (AG) wrote:
>>
>>
>> On 2022/1/14 0:48, David Hildenbrand wrote:
>>> On 13.01.22 17:37, Linus Torvalds wrote:
>>>> On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>> Let's bring Linus in on this, but I think this reintroduces all of the
>>>>> mapcount problems that we've been discussing recently.
>>>>>
>>>>> How about this as an alternative?
>>>>
>>>> No, at that point reuse_swap_page() is the better thing to do.
>>>>
>>>> Don't play games with page_count() (or even worse games with
>>>> swap_count). The page count is only stable if it's 1. Any other value
>>>> means that it can fluctuate due to concurrent lookups, some of which
>>>> can be done locklessly under RCU.
>>>
>>> I'm pretty sure the patch as is will reintroduce the CVE. So I think in
>>
>> Actually, I wonder how reuse_swap_page() in this patch can reintroduce CVE,
>> I think the invoking logic here is as same as that in do_swap_page(). 
>> So, could you give me some hint about this? Thanks :)
> 
> The CVE essentially is
> 
> a) Process A maps an anon page
> -> refcount = 1, mapcount = 1
> b) Process A forks process B
> -> Page mapped R/O
> -> refcount = 2, mapcount = 2
> c) Process B pins the page R/O (e.g., vmsplice)
> -> refcount > 2, mapcount = 2
> d) Process B unmaps the page
> -> refcount > 1, mapcount = 1
> e) Process A writes to the page
> 
> If e) ends up reusing the page for process A (e.g., because mapcount == 1), 
> process B can observe the memory modifications via the
> page pinning
> 
> As I previously said, our COW logic is completely inconsistent
> and should be streamlined. Regarding the CVE, what would have 
> to happen is that we put the page into the swapcache and unmap
> from both processes after c). Then, do_swap_page() would reuse
> the page during e) similarly.
> 
> We do have a check in place that protects against pinned pages
> not being able to enter the swapcache, but IIRC it can race with
> GUP, so it might race with c) eventually, meaning that with swap
> there might still be a small chance for the CVE in current upstream IIRC.
> 
> I think with your patch we could trick do_wp_page() do go down the 
> reuse_swap_page() path and reuse the page, even though there is a pending
> pin on the page from process B.
> 
> 
> a) Process A maps an anon page
> -> refcount = 1, mapcount = 1
> b) Page is added to the swapcache and unmapped
> -> refcount = 1, mapcount = 0
> c) Process A reads page
> -> do_swap_page() will map the page R/O
> -> the page might remain in the swapcache
> -> refcount = 2, mapcount = 1
> d) Process A forks process B
> -> Page mapped R/O
> -> refcount = 3, mapcount = 2
> e) Process B pins the page R/O (e.g., vmsplice)
> -> refcount > 3, mapcount = 2
> d) Process B unmaps the page
> -> refcount > 2, mapcount = 1
> e) Process A writes to the page
> 
> In e), we would go into reuse_swap_page() and could succeed, being left with
> refcount > 1 and mapcount = 1.
> 
> But maybe I am missing something :)
> 
> 
> 
> There are only two real ways I know to handle the scenario above:
> (1) Rely on page_count() == 1 in COW logic to decide whether we can
>     reuse a page. This comes with false negatives, some being harmful,
>     some just being a performance problem.
> (2) Rely on mapcount(), swapcount() ... when deciding to reuse a page
>     and when it's safe to take a GUP pin.
> 
> (2) was discussed in 
>   https://lkml.kernel.org/r/20211217113049.23850-1-david@xxxxxxxxxx
> but was essentially NAKed by Linus in the current form (well, it didn't
> consider the swapcount so it was just buggy, fortunately Linus pointed that
> out).
> 
> To handle problematic false negatives with (1) I'm looking into marking
> anon pages as exclusively owned by a single process, such that we can simply
> reuse them in the COW path -- PageAnonExclusive() also discussed inside that
> posting. That posting also contains a test case which tests all different
> variants of the "child can observe MAP_PRIVAT  modifications of the parent"
> we know of (e.g., THP, hugetlb), which you might use to verify if the issue is
> reintroduced by your patch or not.
> 
> 
> Yesterday, I played with removing reuse_swap_page() completely
> and streamlining our COW logic like so:
> 
> (based on https://www.spinics.net/lists/linux-mm/msg279460.html)
> 
> 
> commit f1fa31a40b13ade5cd93ceb40daadf1196526145
> Author: David Hildenbrand <david@xxxxxxxxxx>
> Date:   Fri Jan 14 09:29:52 2022 +0100
> 
>     mm: optimize do_wp_page() for exclusive pages in the swapcache
>     
>     Let's optimize for a page with a single user that has been added to the
>     swapcache. Try removing the swapcache reference if there is hope that
>     we're the exclusive user, but keep the page_count(page) == 1 check in
>     place.
>     
>     Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page()
>     users next.
>     
>     While at it, remove the superfluous page_mapcount() check: it's
>     implicitly covered by the page_count() for ordinary anon pages.
>     
>     Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e8e2144cbfa6..bd2af7a36791 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3290,12 +3290,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>         if (PageAnon(vmf->page)) {
>                 struct page *page = vmf->page;
>  
> -               /* PageKsm() doesn't necessarily raise the page refcount */
> -               if (PageKsm(page) || page_count(page) != 1)
> +               /*
> +                * PageKsm() doesn't necessarily raise the page refcount.
> +                *
> +                * These checks are racy as long as we haven't locked the page;
> +                * they are a pure optimization to avoid trying to lock the page
> +                * and trying to free the swap cache when there is little hope
> +                * it will actually result in a refcount of 1.
> +                */
> +               if (PageKsm(page) || page_count(page) <= 1 + PageSwapCache(page))
>                         goto copy;
>                 if (!trylock_page(page))
>                         goto copy;
> -               if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> +               if (PageSwapCache(page))
> +                       try_to_free_swap(page);
> +               if (PageKsm(page) || page_count(page) != 1) {
>                         unlock_page(page);
>                         goto copy;
>                 }
> 
> 
> 
> Followed by
> 
> 
> 
> commit 754aa8f04f7b6d1ffc2b9844be63bbf7b527929f (HEAD)
> Author: David Hildenbrand <david@xxxxxxxxxx>
> Date:   Fri Jan 14 09:53:20 2022 +0100
> 
>     mm: streamline COW logic in do_swap_page()
>     
>     Let's apply the same COW logic as in do_wp_page(), conditionally trying to
>     remove the page from the swapcache after freeing the swap entry, however,
>     before actually mapping our page. We can change the order now that
>     reuse_swap_page() is no longer used.
>     
>     Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> 
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index bd2af7a36791..a1bd2b5c818a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3489,6 +3489,24 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>         return 0;
>  }
>  
> +static inline bool should_try_to_free_swap(struct page *page,
> +                                          struct vm_area_struct *vma,
> +                                          unsigned int fault_flags)
> +{
> +       if (!PageSwapCache(page))
> +               return false;
> +       if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
> +           PageMlocked(page))
> +               return true;
> +       /*
> +        * We must only reuse the page if the page_count() is 1. Try to
> +        * free the swapcache to get rid of the swapcache reference in case
> +        * it's likely that we will succeed.
> +        */
> +       return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
> +               page_count(page) == 2;
> +}
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3612,7 +3630,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         }
>  
>         /*
> -        * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
> +        * Make sure try_to_free_swap or swapoff did not
>          * release the swapcache from under us.  The page pin, and pte_same
>          * test below, are not enough to exclude that.  Even if it is still
>          * swapcache, we need to check that the page's swap has not changed.
> @@ -3644,19 +3662,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         }
>  
>         /*
> -        * The page isn't present yet, go ahead with the fault.
> -        *
> -        * Be careful about the sequence of operations here.
> -        * To get its accounting right, reuse_swap_page() must be called
> -        * while the page is counted on swap but not yet in mapcount i.e.
> -        * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
> -        * must be called after the swap_free(), or it will never succeed.
> +        * Remove the swap entry and conditionally try to free up the swapcache.
> +        * We're already holding a reference on the page but haven't mapped it
> +        * yet. We won't be waiting on writeback, so if there is concurrent
> +        * writeback we won't map the page writable just now.
>          */
> +       swap_free(entry);
> +       if (should_try_to_free_swap(page, vma, vmf->flags))
> +               try_to_free_swap(page);
>  
>         inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>         dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
>         pte = mk_pte(page, vma->vm_page_prot);
> -       if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> +
> +       /* Same logic as in do_wp_page(). */
> +       if ((vmf->flags & FAULT_FLAG_WRITE) &&
> +           !PageKsm(page) && page_count(page) == 1) {
>                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>                 vmf->flags &= ~FAULT_FLAG_WRITE;
>                 ret |= VM_FAULT_WRITE;
> @@ -3681,10 +3702,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
>         }
>  
> -       swap_free(entry);
> -       if (mem_cgroup_swap_full(page) ||
> -           (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> -               try_to_free_swap(page);
>         unlock_page(page);
>         if (page != swapcache && swapcache) {
>                 /*
> 
> 
> 
> 
> And followed by more patches that rip out reuse_swap_page() completely.
> 
> 
> But I still have to understand what's safe to do and not do in do_swap_page(),
> and which effect it would have. Especially PTE-mapped THP are a head-scratcher
> for me, and the interaction with THP in the swapcache before/after actual swap.
> 
> So consider above a WIP prototype that's mostly untested.
> 

-- 
Best Regards,
Liang Zhang