Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09.08.24 19:15, Peter Xu wrote:
On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
On 09.08.24 18:08, Peter Xu wrote:
Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
much easier, the write bit needs to be persisted though for writable and
shared pud mappings like PFNMAP ones, otherwise a follow up write in either
parent or child process will trigger a write fault.

Do the same for pmd level.

Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
---
   mm/huge_memory.c | 27 ++++++++++++++++++++++++---
   1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6568586b21ab..015c9468eed5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
   	pgtable_t pgtable = NULL;
   	int ret = -ENOMEM;
+	pmd = pmdp_get_lockless(src_pmd);
+	if (unlikely(pmd_special(pmd))) {
+		dst_ptl = pmd_lock(dst_mm, dst_pmd);
+		src_ptl = pmd_lockptr(src_mm, src_pmd);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+		/*
+		 * No need to recheck the pmd, it can't change with write
+		 * mmap lock held here.
+		 */
+		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
+			pmdp_set_wrprotect(src_mm, addr, src_pmd);
+			pmd = pmd_wrprotect(pmd);
+		}
+		goto set_pmd;
+	}
+

I strongly assume we should be using using vm_normal_page_pmd() instead of
pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
and vm_normal_page_pmd().

One thing to mention that it has this:

	if (!vma_is_anonymous(dst_vma))
		return 0;

Another obscure thing in this function. It's not the job of copy_huge_pmd() to make the decision whether to copy, it's the job of vma_needs_copy() in copy_page_range().

And now I have to suspect that uffd-wp is broken with this function, because as vma_needs_copy() clearly states, we must copy, and we don't do that for PMDs. Ugh.

What a mess, we should just do what we do for PTEs and we will be fine ;)

Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"

Would that even be the case with PFNMAP? I suspect that pmd_trans_huge() would return "true" for special pfnmap, which is rather "surprising", but fortunate for us.

Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for another day.


So it's only about anonymous below that.  In that case I feel like the
pmd_page() is benign, and actually good.

Yes, it would likely currently work.


Though what you're saying here made me notice my above check doesn't seem
to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) &&
pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and
if it's writable for CoW it means it's already an anon.

I think I can probably drop that line there, perhaps with a
VM_WARN_ON_ONCE() making sure it won't happen.


Again, we should be doing this similar to how we handle PTEs.

I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below:
what else should we have here if it's not a migration entry but a present
entry?

I had a feeling that it was just a safety belt since the 1st day of thp
when Andrea worked that out, so that it'll work with e.g. file truncation
races.

But with current code it looks like it's only anonymous indeed, so looks
not possible at least from that pov.

Yes, as stated above, likely broken with UFFD-WP ...

I really think we should make this code just behave like it would with PTEs, instead of throwing in more "different" handling.

--
Cheers,

David / dhildenb





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux