Re: [GIT PULL v2 09/20] KVM: s390: move pv gmap functions into kvm

David Hildenbrand <david@xxxxxxxxxx> · Thu, 13 Feb 2025 11:02:49 +0100

On 12.02.25 19:14, David Hildenbrand wrote:
On 12.02.25 18:45, Claudio Imbrenda wrote:
On Wed, 12 Feb 2025 17:55:18 +0100
David Hildenbrand <david@xxxxxxxxxx> wrote:

On 31.01.25 12:24, Claudio Imbrenda wrote:
Move gmap related functions from kernel/uv into kvm.

Create a new file to collect gmap-related functions.

Reviewed-by: Janosch Frank <frankja@xxxxxxxxxxxxx>
Reviewed-by: Christoph Schlameuss <schlameuss@xxxxxxxxxxxxx>
[fixed unpack_one(), thanks mhartmay@xxxxxxxxxxxxx]
Link: https://lore.kernel.org/r/20250123144627.312456-6-imbrenda@xxxxxxxxxxxxx
Signed-off-by: Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx>
Message-ID: <20250123144627.312456-6-imbrenda@xxxxxxxxxxxxx>
---

This patch breaks large folio splitting because you end up un-refing
the wrong folios after a split; I tried to make it work, but either
because of other changes in this patch (or in others), I
cannot get it to work and have to give up for today.

yes, I had also noticed that and I already have a fix ready. In fact my
fix was exactly like yours, except that I did not pass the struct folio
anymore to kvm_s390_wiggle_split_folio(), but instead I only pass a
page and use page_folio() at the beginning, and I use
split_huge_page_to_list_to_order() directly instead of split_folio()
   
unfortunately the fix does not fix the issue I'm seeing....

but putting printks everywhere seems to solve the issue, so it seems to
be a race somewhere

It also doesn't work with a single vCPU for me. The VM is stuck in

With a two vCPUs (so one can report the lockup), I get:

[   62.645168] rcu: INFO: rcu_sched self-detected stall on CPU
[   62.645181] rcu:     0-....: (5999 ticks this GP) idle=0104/1/0x4000000000000002 softirq=2/2 fqs=2997
[   62.645186] rcu:     (t=6000 jiffies g=-1199 q=62 ncpus=2)
[   62.645191] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-427.33.1.el9_4.s390x #1
[   62.645194] Hardware name: IBM 3931 LA1 400 (KVM/Linux)
[   62.645195] Krnl PSW : 0704c00180000000 0000000024b3e776 (set_memory_decrypted+0x66/0xa0)
[   62.645206]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[   62.645208] Krnl GPRS: 00000000ca004000 0000037f00000001 000000008092f000 0000000000000000
[   62.645210]            0000037fffb1bbc0 0000000000000001 0000000025e75208 000000008092f000
[   62.645211]            0000000080873808 0000037fffb1bcd8 0000000000001000 0000000025e75220
[   62.645213]            0000000080281500 00000000258aa480 0000000024c0b17a 0000037fffb1bb20
[   62.645220] Krnl Code: 0000000024b3e76a: a784000f            brc     8,0000000024b3e788
[   62.645220]            0000000024b3e76e: a7210fff            tmll    %r2,4095
[   62.645220]           #0000000024b3e772: a7740017            brc     7,0000000024b3e7a0
[   62.645220]           >0000000024b3e776: b9a40034            uvc     %r3,%r4,0
[   62.645220]            0000000024b3e77a: b2220010            ipm     %r1
[   62.645220]            0000000024b3e77e: 8810001c            srl     %r1,28
[   62.645220]            0000000024b3e782: ec12fffa017e        cij     %r1,1,2,0000000024b3e776
[   62.645220]            0000000024b3e788: a72b1000            aghi    %r2,4096
[   62.645232] Call Trace:
[   62.645234]  [<0000000024b3e776>] set_memory_decrypted+0x66/0xa0
[   62.645238]  [<0000000024c0b17a>] dma_direct_alloc+0x16a/0x2d0
[   62.645242]  [<0000000024c09b92>] dma_alloc_attrs+0x62/0x80
[   62.645243]  [<000000002546c950>] cio_gp_dma_create+0x60/0xa0
[   62.645248]  [<0000000025ebb712>] css_bus_init+0x102/0x1b8
[   62.645252]  [<0000000025ebb7ea>] channel_subsystem_init+0x22/0xf8
[   62.645254]  [<0000000024b149ac>] do_one_initcall+0x3c/0x200
[   62.645256]  [<0000000025e777be>] do_initcalls+0x11e/0x148
[   62.645260]  [<0000000025e77a34>] kernel_init_freeable+0x1cc/0x208
[   62.645262]  [<00000000254ad01e>] kernel_init+0x2e/0x170
[   62.645264]  [<0000000024b16fdc>] __ret_from_fork+0x3c/0x60
[   62.645266]  [<00000000254bb07a>] ret_from_fork+0xa/0x40


I can only suspect that it is related to the following: if we split a non-anon
folio, we unmap it from the page tables, and don't remap it again -- the next
fault will do that. Maybe, for some reason that behavior is incompatible with your changes.

I don't quit see how, because we should just trigger another fault to look up
the page in gmap_make_secure()->gfn_to_page() when we re-enter gmap_make_secure() after a split.


The removed PTE lock would only explain it if we would have a concurrent GUP etc.
from QEMU I/O ? Not sure.

To fix the wrong refcount freezing, doing exactly what folio splitting does
(migration PTEs, locking the pagecache etc., freezing->converting,
removing migration ptes) should work, but requires a bit of work.

I played with the following abomination to see if I could fix the refcount freezing somehow.

It doesn't work, because the UVC keeps failing: I assume because it actually
needs the page to be mapped into that particular page table for the UVC to complete.


To fix refcount freezing with that (folio still mapped), we'd have to make sure that
folio_mapcount()==1 while we hold the PTL, and doing something similar to below,
except that the rmap/anon locking and unmap/remap handling would not apply. The
pagecache most likely would have to be locked to prevent new references from that while
we freeze the refcount.

In case we would have folio_mapcount() != 1 on an anon page, we would have to give up:
impossible if it is mapped writable -- so no problem.

In case we would have folio_mapcount() != 1 on a pagecache page, we would have to
force an unmap of the all page table mappings using e.g., try_to_unmap(), to then retry
again.

But the PTL seems unavoidable in that case to prevent concurrent GUP-slow etc, so we
can safely freeze the refcount.


From c2555fc34801ca9ba49f93ee1249ecd25248377a Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@xxxxxxxxxx>
Date: Thu, 13 Feb 2025 09:49:54 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
 arch/s390/kernel/uv.c | 139 +++++++++++++++++++++++++++++++++++-------
 include/linux/rmap.h  |  17 ++++++
 mm/internal.h         |  16 -----
 3 files changed, 133 insertions(+), 39 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 9f05df2da2f73..d6ea8951fa53b 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -15,6 +15,7 @@
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/pagewalk.h>
+#include <linux/rmap.h>
 #include <asm/facility.h>
 #include <asm/sections.h>
 #include <asm/uv.h>
@@ -227,6 +228,45 @@ static int expected_folio_refs(struct folio *folio)
 	return res;
 }
 
+static void unmap_folio(struct folio *folio)
+{
+	enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SYNC |
+		TTU_BATCH_FLUSH;
+
+	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
+
+	if (folio_test_pmd_mappable(folio))
+		ttu_flags |= TTU_SPLIT_HUGE_PMD;
+
+	/*
+	 * Anon pages need migration entries to preserve them, but file
+	 * pages can simply be left unmapped, then faulted back on demand.
+	 * If that is ever changed (perhaps for mlock), update remap_page().
+	 */
+	if (folio_test_anon(folio))
+		try_to_migrate(folio, ttu_flags);
+	else
+		try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK);
+
+	try_to_unmap_flush();
+}
+
+static void remap_page(struct folio *folio, unsigned long nr, int flags)
+{
+	int i = 0;
+
+	/* If unmap_folio() uses try_to_migrate() on file, remove this check */
+	if (!folio_test_anon(folio))
+		return;
+	for (;;) {
+		remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
+		i += folio_nr_pages(folio);
+		if (i >= nr)
+			break;
+		folio = folio_next(folio);
+	}
+}
+
 /**
  * make_folio_secure() - make a folio secure
  * @folio: the folio to make secure
@@ -247,35 +287,88 @@ static int expected_folio_refs(struct folio *folio)
  */
 int make_folio_secure(struct folio *folio, struct uv_cb_header *uvcb)
 {
-	int expected, cc = 0;
+	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
+	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
+	int ret, cc = 0;
+	int expected;
 
 	if (folio_test_large(folio))
 		return -E2BIG;
 	if (folio_test_writeback(folio))
 		return -EBUSY;
-	expected = expected_folio_refs(folio) + 1;
-	if (!folio_ref_freeze(folio, expected))
+
+	/* Does it make sense to try at all? */
+	if (folio_ref_count(folio) != expected_folio_refs(folio) + 1)
 		return -EBUSY;
-	set_bit(PG_arch_1, &folio->flags);
-	/*
-	 * If the UVC does not succeed or fail immediately, we don't want to
-	 * loop for long, or we might get stall notifications.
-	 * On the other hand, this is a complex scenario and we are holding a lot of
-	 * locks, so we can't easily sleep and reschedule. We try only once,
-	 * and if the UVC returned busy or partial completion, we return
-	 * -EAGAIN and we let the callers deal with it.
-	 */
-	cc = __uv_call(0, (u64)uvcb);
-	folio_ref_unfreeze(folio, expected);
-	/*
-	 * Return -ENXIO if the folio was not mapped, -EINVAL for other errors.
-	 * If busy or partially completed, return -EAGAIN.
-	 */
-	if (cc == UVC_CC_OK)
-		return 0;
-	else if (cc == UVC_CC_BUSY || cc == UVC_CC_PARTIAL)
-		return -EAGAIN;
-	return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
+
+	/* See split_huge_page_to_list_to_order() on the nasty details. */
+	if (folio_test_anon(folio)) {
+		anon_vma = folio_get_anon_vma(folio);
+		if (!anon_vma)
+			return -EBUSY;
+		anon_vma_lock_write(anon_vma);
+	} else {
+		mapping = folio->mapping;
+		if (!mapping)
+			return -EBUSY;
+		/* Hmmm, do we need filemap_release_folio()? */
+		i_mmap_lock_read(mapping);
+	}
+
+	unmap_folio(folio);
+
+	local_irq_disable();
+	if (mapping) {
+		xas_lock(&xas);
+		xas_reset(&xas);
+		if (xas_load(&xas) != folio) {
+			ret = -EBUSY;
+			goto fail;
+		}
+	}
+
+	expected = expected_folio_refs(folio) + 1;
+	if (!folio_mapped(folio) &&
+	    folio_ref_freeze(folio, expected)) {
+		set_bit(PG_arch_1, &folio->flags);
+		/*
+		 * If the UVC does not succeed or fail immediately, we don't want to
+		 * loop for long, or we might get stall notifications.
+		 * On the other hand, this is a complex scenario and we are holding a lot of
+		 * locks, so we can't easily sleep and reschedule. We try only once,
+		 * and if the UVC returned busy or partial completion, we return
+		 * -EAGAIN and we let the callers deal with it.
+		 */
+		cc = __uv_call(0, (u64)uvcb);
+		folio_ref_unfreeze(folio, expected);
+		/*
+		 * Return -ENXIO if the folio was not mapped, -EINVAL for other errors.
+		 * If busy or partially completed, return -EAGAIN.
+		 */
+		if (cc == UVC_CC_OK)
+			ret = 0;
+		else if (cc == UVC_CC_BUSY || cc == UVC_CC_PARTIAL)
+			ret = -EAGAIN;
+		else
+			ret = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
+	} else {
+		ret = -EBUSY;
+	}
+
+	if (mapping)
+		xas_unlock(&xas);
+fail:
+	local_irq_enable();
+	remap_page(folio, 1, 0);
+	if (anon_vma) {
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
+	if (mapping)
+		i_mmap_unlock_read(mapping);
+	xas_destroy(&xas);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(make_folio_secure);
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 683a04088f3f2..2d241ab48bf08 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -663,6 +663,23 @@ int folio_referenced(struct folio *, int is_locked,
 void try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+void try_to_unmap_flush(void);
+void try_to_unmap_flush_dirty(void);
+void flush_tlb_batched_pending(struct mm_struct *mm);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+static inline void try_to_unmap_flush_dirty(void)
+{
+}
+static inline void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
+
 int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, struct page **pages,
 				void *arg);
diff --git a/mm/internal.h b/mm/internal.h
index 109ef30fee11f..5338906163ca7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1202,22 +1202,6 @@ struct tlbflush_unmap_batch;
  */
 extern struct workqueue_struct *mm_percpu_wq;
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-void try_to_unmap_flush(void);
-void try_to_unmap_flush_dirty(void);
-void flush_tlb_batched_pending(struct mm_struct *mm);
-#else
-static inline void try_to_unmap_flush(void)
-{
-}
-static inline void try_to_unmap_flush_dirty(void)
-{
-}
-static inline void flush_tlb_batched_pending(struct mm_struct *mm)
-{
-}
-#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
-
 extern const struct trace_print_flags pageflag_names[];
 extern const struct trace_print_flags vmaflag_names[];
 extern const struct trace_print_flags gfpflag_names[];
--
2.48.1



--
Cheers,

David / dhildenb