On 20.01.22 16:37, Linus Torvalds wrote: > On Thu, Jan 20, 2022 at 5:26 PM David Hildenbrand <david@xxxxxxxxxx> wrote: >> >> So your claim is that read-only, PTE mapped pages are weird? How do you >> come to that conclusion? >> >> If we adjust the THP reuse logic to split on additional references >> (page_count() == 1) -- similarly as suggested by Linus to fix the CVE -- >> we're going to end up with exactly that more frequently. > > If you write to a THP page that has page_count() elevated - presumably > because of a fork() - then that COW is exactly what you want to > happen. > > And presumably you want it to happen page-by-page. > > So I fail to see what the problem is. > > The *normal* THP case is that there hasn't been a fork, and there is > no COW activity. That's the only thing worth trying to optimize for > and worry about. > > If you do some kind of fork with huge-pages, and actually write to > those pages (as opposed to just execve() in the child and wait in the > parent), you only have yourself to blame. You *will* take COW faults, > and you have to do it, and at that point spliting the THP in whoever > did the COW is likely the right thing to do just to hope that you > don't have to allocate a whole new hugepage. So it will cause new > (small-page) allocations and copies. > > And yes, at that point, writes to the THP page will cause COW's for > both sides as they both end up making that "split it" decision. > > Honestly, would anything else ever even make sense? > > If you care about THP performance, you make sure that the COW THP case > simply never happens. I'm, not concerned about fork(), I'm concerned about other false positives. Here is what I currently have, I hope that makes sense: commit f8fe5e15e04f7563606d58daee795aa0cc0e454c (HEAD) Author: David Hildenbrand <david@xxxxxxxxxx> Date: Fri Jan 14 09:53:35 2022 +0100 mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() We currently have a different COW logic for THP than we have for ordinary pages in do_wp_page(): the effect is that the issue reported in CVE-2020-29374 is currently still possible for THP: a child process can observe memory modifications of the parent process to MAP_PRIVATE pages. Let's apply the same logic (page_count() == 1), conditionally trying to remove the page from the swapcache after freeing the swap entry, however, before actually mapping our page. Note that KSM does not apply to THP and that we won't be waiting on writeback. In the future, we might want to wait for writeback in case the page is not swapped by any other process. For now, let's keep it simple. If we find a page_count() > 1, we'll have to split and fallback to do_wp_page(), which will copy the page. This change has the following effects: (1) Mapping a THP with GUP references read-only and triggering a write fault will result in a split and disconnect of GUP from the actual page table just as we know it for ordinary pages already: this will require a fix for all anonymous pages, for example, by marking anon pages exclusive to a single process and allowing GUP pins only on exclusive anon pages. (2) Additional references on the compound page, or concurrent writeback, will now result in the THP getting mapped via PTEs, whereby each PTE requires a reference on the compound page. Once we have a PTE-mapped PMD, do_wp_page() will always copy and never reuse. Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..a4f2754142aa 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1303,7 +1303,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) page = pmd_page(orig_pmd); VM_BUG_ON_PAGE(!PageHead(page), page); - /* Lock page for reuse_swap_page() */ if (!trylock_page(page)) { get_page(page); spin_unlock(vmf->ptl); @@ -1319,10 +1318,14 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) } /* - * We can only reuse the page if nobody else maps the huge page or it's - * part. + * See do_wp_page(): we can reuse only if there are no additional + * references (page_count() is 1). Try to free the swapcache to get + * rid of the swapcache references in case it's likely that we will + * succeed. */ - if (reuse_swap_page(page)) { + if (PageSwapCache(page) && page_count(page) == 1 + thp_nr_pages(page)) + try_to_free_swap(page); + if (page_count(page) == 1) { pmd_t entry; entry = pmd_mkyoung(orig_pmd); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); -- Thanks, David / dhildenb