On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote: > On Sat, Feb 08, 2025, Yan Zhao wrote: > > In the read-only mprotect() phase of mmu_stress_test, ensure that > > mprotect(PROT_READ) has completed before the guest starts writing to the > > read-only mprotect() memory. > > > > Without waiting for mprotect_ro_done before the guest starts writing in > > stage 3 (the stage for read-only mprotect()), the host's assertion of stage > > 3 could fail if mprotect_ro_done is set to true in the window between the > > guest finishing writes to all GPAs and executing GUEST_SYNC(3). > > > > This scenario is easy to occur especially when there are hundred of vCPUs. > > > > CPU 0 CPU 1 guest CPU 1 host > > enter stage 3's 1st loop > > //in stage 3 > > write all GPAs > > @rip 0x4025f0 > > > > mprotect(PROT_READ) > > mprotect_ro_done=true > > GUEST_SYNC(3) > > r=0, continue stage 3's 1st loop > > > > //in stage 4 > > write GPA > > @rip 0x402635 > > > > -EFAULT, jump out stage 3's 1st loop > > enter stage 3's 2nd loop > > write GPA > > @rip 0x402635 > > -EFAULT, continue stage 3's 2nd loop > > guest rip += 3 > > > > The test then fails and reports "Unhandled exception '0xe' at guest RIP > > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the > > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction > > of length 4. > > This shouldn't happen. On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not > vcpu_arch_put_guest(). Either something else is going on, or __x86_64__ isn't > defined? stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with vcpu_arch_put_guest(). The original code expects that "mov %rax, (%rax)" in stage 3 can produce -EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run() loop. /* * Stage 3, write guest memory and verify KVM returns -EFAULT for once * the mprotect(PROT_READ) lands. Only architectures that support * validating *all* of guest memory sync for this stage, as vCPUs will * be stuck on the faulting instruction for other architectures. Go to * stage 3 without a rendezvous */ do { r = _vcpu_run(vcpu); } while (!r); TEST_ASSERT(r == -1 && errno == EFAULT, "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno); Then, in host stage 3's 2st vcpu_run() loop, rip += 3 is performed to skip "mov %rax, (%rax)" in guest. for (;;) { r = _vcpu_run(vcpu); if (!r) break; TEST_ASSERT_EQ(errno, EFAULT); #if defined(__x86_64__) WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS); vcpu->run->s.regs.regs.rip += 3; #elif defined(__aarch64__) vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc), vcpu_get_reg(vcpu, ARM64_CORE_REG(regs.pc)) + 4); #endif } Finally, guest can call GUEST_SYNC(3) to let host jump out of the 2nd vcpu_run() loop and host assert_sync_stage(vcpu, 3). However, there're chances that all "mov %rax, (%rax)" in stage 3 does not cause any -EFAULT. The guest then leaves stage 3 after finding mprotect_ro_done=true. GUEST_SYNC(3) causes r=0, so host continues stage 3's first vcpu_run() loop. Then mprotect(PROT_READ) takes effect after the guest enters stage 4. vcpu_arch_put_guest() in guest stage 4 produces -EFAULT and host jumps out of stage 3's first vcpu_run() loop. The rip+=3 in host stage 3's second vcpu_run() loop does not match vcpu_arch_put_guest(), producing the "Unhandled exception '0xe'" error. > > do { > for (gpa = start_gpa; gpa < end_gpa; gpa += stride) > #ifdef __x86_64__ > asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */ > #elif defined(__aarch64__) > asm volatile("str %0, [%0]" :: "r" (gpa) : "memory"); > #else > vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); > #endif > } while (!READ_ONCE(mprotect_ro_done)); > > /* > * Only architectures that write the entire range can explicitly sync, > * as other architectures will be stuck on the write fault. > */ > #if defined(__x86_64__) || defined(__aarch64__) > GUEST_SYNC(3); > #endif > > for (gpa = start_gpa; gpa < end_gpa; gpa += stride) > vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); > GUEST_SYNC(4); > > > > > Even if it could be compiled into a mov instruction of length 3, the > > following execution of GUEST_SYNC(4) in guest could still cause the host > > failure of the assertion of stage 3. > > Sorry, I don't follow. The guest doesn't get "released" from GUEST_SYNC(3) until > the host runs the vCPU again, and that happens after asserting on stage 3 and > synchronizing with the main thread. //guest stage 3 do { for (...) mov %rax, (%rax) 3.1 ==> host mprotect(PROT_READ) does not yet take effect, mprotect_ro_done=false } while (!READ_ONCE(mprotect_ro_done)); ==> 3.2 host mprotect(PROT_READ) takes effect here. mprotect_ro_done=true GUEST_SYNC(3); ==> host stage 3's vcpu_run() returns 0. so host is still in stage 3's first vcpu_run() loop //guest stage 4. with host mprotect(PROT_READ) in effect, vcpu_arch_put_guest() will cause -EFAULT. host enters host stage 3's second vcpu_run() loop. for (...) vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); ==> wrong for rip += 3 GUEST_SYNC(4); ==> since host still in stage 3, even if we change vcpu_arch_put_guest() in guest stage 4 to "mov %rax, (%rax)", this cannot pass host's assert_sync_stage(vcpu, 3); > > assert_sync_stage(vcpu, 3); > #endif /* __x86_64__ || __aarch64__ */ > rendezvous_with_boss(); > > /* > * Stage 4. Run to completion, waiting for mprotect(PROT_WRITE) to > * make the memory writable again. > */ > do { > r = _vcpu_run(vcpu); > } while (r && errno == EFAULT);