In the read-only mprotect() phase of mmu_stress_test, ensure that mprotect(PROT_READ) has completed before the guest starts writing to the read-only mprotect() memory. Without waiting for mprotect_ro_done before the guest starts writing in stage 3 (the stage for read-only mprotect()), the host's assertion of stage 3 could fail if mprotect_ro_done is set to true in the window between the guest finishing writes to all GPAs and executing GUEST_SYNC(3). This scenario is easy to occur especially when there are hundred of vCPUs. CPU 0 CPU 1 guest CPU 1 host enter stage 3's 1st loop //in stage 3 write all GPAs @rip 0x4025f0 mprotect(PROT_READ) mprotect_ro_done=true GUEST_SYNC(3) r=0, continue stage 3's 1st loop //in stage 4 write GPA @rip 0x402635 -EFAULT, jump out stage 3's 1st loop enter stage 3's 2nd loop write GPA @rip 0x402635 -EFAULT, continue stage 3's 2nd loop guest rip += 3 The test then fails and reports "Unhandled exception '0xe' at guest RIP '0x402638'", since the next valid guest rip address is 0x402639, i.e. the "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction of length 4. Even if it could be compiled into a mov instruction of length 3, the following execution of GUEST_SYNC(4) in guest could still cause the host failure of the assertion of stage 3. Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> --- tools/testing/selftests/kvm/mmu_stress_test.c | 25 +++++++++++-------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c index d9c76b4c0d88..a2ccf705bb2a 100644 --- a/tools/testing/selftests/kvm/mmu_stress_test.c +++ b/tools/testing/selftests/kvm/mmu_stress_test.c @@ -35,11 +35,16 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride) GUEST_SYNC(2); /* - * Write to the region while mprotect(PROT_READ) is underway. Keep - * looping until the memory is guaranteed to be read-only, otherwise - * vCPUs may complete their writes and advance to the next stage - * prematurely. - * + * mprotect(PROT_READ) is underway. Keep looping until the memory is + * guaranteed to be read-only, otherwise vCPUs may complete their + * writes and advance to the next stage prematurely. + */ + do { + ; + } while (!READ_ONCE(mprotect_ro_done)); + + /* + * Write to the region after mprotect(PROT_READ) is done. * For architectures that support skipping the faulting instruction, * generate the store via inline assembly to ensure the exact length * of the instruction is known and stable (vcpu_arch_put_guest() on @@ -47,16 +52,14 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride) * is low in this case). For x86, hand-code the exact opcode so that * there is no room for variability in the generated instruction. */ - do { - for (gpa = start_gpa; gpa < end_gpa; gpa += stride) + for (gpa = start_gpa; gpa < end_gpa; gpa += stride) #ifdef __x86_64__ - asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */ + asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */ #elif defined(__aarch64__) - asm volatile("str %0, [%0]" :: "r" (gpa) : "memory"); + asm volatile("str %0, [%0]" :: "r" (gpa) : "memory"); #else - vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); + vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); #endif - } while (!READ_ONCE(mprotect_ro_done)); /* * Only architectures that write the entire range can explicitly sync, -- 2.43.2