On Fri, Sep 6, 2024 at 5:53 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Fri, Sep 06, 2024, James Houghton wrote: > > On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > Add two phases to mmu_stress_test to verify that KVM correctly handles > > > guest memory that was writable, and then made read-only in the primary MMU, > > > and then made writable again. > > > > > > Add bonus coverage for x86 to verify that all of guest memory was marked > > > read-only. Making forward progress (without making memory writable) > > > requires arch specific code to skip over the faulting instruction, but the > > > test can at least verify each vCPU's starting page was made read-only. > > > > > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > > > Writing off-list because I just have some smaller questions that I > > don't want to bother the list with. > > Pulling everyone and the lists back in :-) > > IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that's 99% > archival and 1% list, at best. Odds are very, very good that if you have a > question, however trivial or small, then someone else has the exact same question, > or _will_ have the question in the future. > > I strongly prefer that all questions, review, feedback, etc. happen on list, even > if the questions/feedback may seem trivial or noisy. The only exception is if > information can't/shouldn't be made public, e.g. because of an embargo, NDA, > security implications, etc. I'll keep this in mind, thanks! > > For the next version, feel free to add: > > > > Reviewed-by: James Houghton <jthoughton@xxxxxxxxxx> > > > > All of the selftest patches look fine to me. > > > > > --- > > > tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++- > > > 1 file changed, 84 insertions(+), 3 deletions(-) > > > > > > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c > > > index 50c3a17418c4..98f9a4660269 100644 > > > --- a/tools/testing/selftests/kvm/mmu_stress_test.c > > > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c > > > @@ -16,6 +16,8 @@ > > > #include "guest_modes.h" > > > #include "processor.h" > > > > > > +static bool mprotect_ro_done; > > > + > > > static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride) > > > { > > > uint64_t gpa; > > > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride) > > > *((volatile uint64_t *)gpa); > > > GUEST_SYNC(2); > > > > > > + /* > > > + * Write to the region while mprotect(PROT_READ) is underway. Keep > > > + * looping until the memory is guaranteed to be read-only, otherwise > > > + * vCPUs may complete their writes and advance to the next stage > > > + * prematurely. > > > + */ > > > + do { > > > + for (gpa = start_gpa; gpa < end_gpa; gpa += stride) > > > +#ifdef __x86_64__ > > > + asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); > > > > Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = gpa. :) > > > > Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov > > BYTE PTR [rax], 0x0`. (just curious, no need to change it) > > LOL, yes, but as evidenced by the trailing comment, my intent was to generate > "mov rax, [rax]", not "movb $0, [rax]". I suspect I was too lazy to consult the > SDM to recall the correct opcode and simply copied an instruction from some random > disassembly output without looking too closely at the output. > > asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */ > > > And I take it you wrote it out like this (instead of using mnemonics) > > so that you could guarantee that IP + 4 would be the right way to skip > > forwards. Does it make sense to leave a comment about that? > > Yes and yes. > > > The translation from mnemonic --> bytes won't change... > > Heh, this is x86, by no means is that guaranteed. E.g. see the above, where the > same mnemonic can be represented multiple ways. > > > so could you just write the proper assembly? (not a request, just curious) > > In practice, probably. But the rules for inline assembly are, at best, fuzzy. > So long as the correct instruction is generated, the assembler has quite a bit > of freedom. > > E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as: > > 48 89 00 > > but can also be encoded as > > 48 89 40 00 > > Now, it's _extremely_ unlikely a compiler will actually generate the latter, but > it's perfectly legal to do so. E.g. with gcc-13, this > > mov %rax, 0x0(%rax) > > generates > > 48 89 00 > > even though a more literal interpretation would be > > 48 89 40 00 Oh... neat! I'm glad I asked about this. > > So yeah, while the hand-coded opcode is gross and annoying, given that a failure > due to the "wrong" instruction being generated would be painful and time consuming > to debug, hand-coding is worth avoiding the risk and potential pain if the compiler > decides to be mean :-) I 100% agree with hand-writing the opcode given what you've said. :) > > A comment that 0x40 corresponds to %rax and that "a" also corresponds > > to %rax would have been helpful for me. :) > > Eh, I get what you're saying, but giving a play-by-play of the encoding isn't > really all that reasonable because _so_ much information needs to be conveyed to > capture the entire picture, and some things are essentially table stakes when it > comes to x86 kernel programming. > > > E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the > addressing mode, which in turn depends on the operating mode (64-bit). > > And "a" isn't just %rax; it's specifically an input register constraint, e.g. is > distinctly different than: > > asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax, (%rax) */ > > even though in this specific scenario they generate the same code. > > And with the correct "48 89 00", understanding the full encoding requires describing > REX prefixes, which are a mess unto themselves. > > So, a trailing comment (with the correct mnemonic) is all I'm willing to do, even > though I 100% agree that it puts a decent sized load on the reader. There's just > _too_ much information to communicate to the reader, at least for x86. The trailing comment works for me! Thanks for all the detail -- I am learning so much. > > > > +#else > > > + vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); > > > +#endif > > > + } while (!READ_ONCE(mprotect_ro_done)); > > > + > > > + /* > > > + * Only x86 can explicitly sync, as other architectures will be stuck > > > + * on the write fault. > > > > It would also have been a little clearer if the comment also said how > > this is just because the test has been written to increment for the PC > > upon getting these write faults *for x86 only*. IDK something like > > "For x86, the test will adjust the PC for each write fault, allowing > > the above loop to complete. Other architectures will get stuck, so the > > #3 sync step is skipped." > > Ya. Untested, but how about this? LGTM! (I haven't tested it either.) > > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c > index 2d66c2724336..29acb22ea387 100644 > --- a/tools/testing/selftests/kvm/mmu_stress_test.c > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c > @@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride) > * looping until the memory is guaranteed to be read-only, otherwise > * vCPUs may complete their writes and advance to the next stage > * prematurely. > + * > + * For architectures that support skipping the faulting instruction, > + * generate the store via inline assembly to ensure the exact length > + * of the instruction is known and stable (vcpu_arch_put_guest() on > + * fixed-length architectures should work, but the cost of paranoia > + * is low in this case). For x86, hand-code the exact opcode so that > + * there is no room for variability in the generated instruction. > */ > do { > for (gpa = start_gpa; gpa < end_gpa; gpa += stride) > #ifdef __x86_64__ > - asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */ > + asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */ FWIW I much prefer the trailing comment you have ended up with vs. the one you had before. (To me, the older one _seems_ like it's Intel syntax, in which case the comment says it's a load..? The comment you have now is, to me, obviously indicating a store. Though... perhaps "movq"?) > #elif defined(__aarch64__) > asm volatile("str %0, [%0]" :: "r" (gpa) : "memory"); > #else > @@ -163,7 +170,7 @@ static void *vcpu_worker(void *data) > TEST_ASSERT_EQ(errno, EFAULT); > #ifdef __x86_64__ > WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS); > - vcpu->run->s.regs.regs.rip += 4; > + vcpu->run->s.regs.regs.rip += 3; > #endif > #ifdef __aarch64__ > vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc), >