[PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO in mmu_stress_test

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Sat, 8 Feb 2025 18:53:18 +0800

In the read-only mprotect() phase of mmu_stress_test, ensure that
mprotect(PROT_READ) has completed before the guest starts writing to the
read-only mprotect() memory.

Without waiting for mprotect_ro_done before the guest starts writing in
stage 3 (the stage for read-only mprotect()), the host's assertion of stage
3 could fail if mprotect_ro_done is set to true in the window between the
guest finishing writes to all GPAs and executing GUEST_SYNC(3).

This scenario is easy to occur especially when there are hundred of vCPUs.

CPU 0                  CPU 1 guest     CPU 1 host
                                       enter stage 3's 1st loop
                       //in stage 3
                       write all GPAs
                       @rip 0x4025f0

mprotect(PROT_READ)
mprotect_ro_done=true
                       GUEST_SYNC(3)
                                       r=0, continue stage 3's 1st loop

                       //in stage 4
                       write GPA
                       @rip 0x402635

                                       -EFAULT, jump out stage 3's 1st loop
                                       enter stage 3's 2nd loop
                       write GPA
                       @rip 0x402635
                                       -EFAULT, continue stage 3's 2nd loop
                                       guest rip += 3

The test then fails and reports "Unhandled exception '0xe' at guest RIP
'0x402638'", since the next valid guest rip address is 0x402639, i.e. the
"(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
of length 4.

Even if it could be compiled into a mov instruction of length 3, the
following execution of GUEST_SYNC(4) in guest could still cause the host
failure of the assertion of stage 3.

Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 25 +++++++++++--------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index d9c76b4c0d88..a2ccf705bb2a 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -35,11 +35,16 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 	GUEST_SYNC(2);
 
 	/*
-	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
-	 * looping until the memory is guaranteed to be read-only, otherwise
-	 * vCPUs may complete their writes and advance to the next stage
-	 * prematurely.
-	 *
+	 * mprotect(PROT_READ) is underway.  Keep looping until the memory is
+	 * guaranteed to be read-only, otherwise vCPUs may complete their
+	 * writes and advance to the next stage prematurely.
+	 */
+	do {
+		;
+	} while (!READ_ONCE(mprotect_ro_done));
+
+	/*
+	 * Write to the region after mprotect(PROT_READ) is done.
 	 * For architectures that support skipping the faulting instruction,
 	 * generate the store via inline assembly to ensure the exact length
 	 * of the instruction is known and stable (vcpu_arch_put_guest() on
@@ -47,16 +52,14 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 	 * is low in this case).  For x86, hand-code the exact opcode so that
 	 * there is no room for variability in the generated instruction.
 	 */
-	do {
-		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-			asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
+		asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
-			asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
+		asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
-			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
+		vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-	} while (!READ_ONCE(mprotect_ro_done));
 
 	/*
 	 * Only architectures that write the entire range can explicitly sync,
-- 
2.43.2