On Tue, 2022-11-22 at 18:39 +0000, Sean Christopherson wrote: > On Sat, Nov 19, 2022, David Woodhouse wrote: > > From: David Woodhouse < > > dwmw@xxxxxxxxxxxx > > > > > > > The guest runstate area can be arbitrarily byte-aligned. In fact, even > > when a sane 32-bit guest aligns the overall structure nicely, the 64-bit > > fields in the structure end up being unaligned due to the fact that the > > 32-bit ABI only aligns them to 32 bits. > > > > So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE > > is buggy, because if it's unaligned then we can't update the whole field > > atomically; the low bytes might be observable before the _UPDATE bit is. > > Xen actually updates the *byte* containing that top bit, on its own. KVM > > should do the same. > > I think we're using the wrong APIs to update the runstate. The VMCS/VMCB pages > _need_ the host pfn, i.e. need to use a gpc (eventually). The Xen PV stuff on the > other hand most definitely doesn't need to know the pfn. > > The event channel code would be difficult to convert due to lack of uaccess > primitives, but I don't see anything in the runstate code that prevents KVM from > using a gfn_to_hva_cache. That will naturally handle page splits by sending them > down a slow path and would yield far simpler code. > > If taking the slow path is an issue, then the guest really should be fixed to not > split pages. And if that's not an acceptable answer, the gfn_to_hva_cache code > could be updated to use the fast path if the region is contiguous in the host > virtual address space. > Yeah, that's tempting. Going back to gfn_to_hva_cache was the first thing I tried. There are a handful of complexifying factors, none of which are insurmountable if we try hard enough. • Even if we fix the gfn_to_hva_cache to still use the fast path for more than the first page, it's still possible for the runstate to cross from one memslot to an adjacent one. We probably still need two of them, which is a large part of the ugliness of this patch. • Accessing it via the userspace HVA requires coping with the case where the vCPU is being scheduled out, and it can't sleep. The previous code before commit 34d41d19e dealt with that by using pagefault_disable() and depending on the GHC fast path. But even so... • We also could end up having to touch page B, page A then page B again, potentially having to abort and leave the runstate with the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the version which checks that both pages are present before it starts writing anything to the guest. As I said, they're not insurmountable but that's why I ended up with the patch you see, after first attempting to use a gfn_to_hva_cache again. Happy to entertain suggestions. I note we have a kvm_read_guest_atomic() and briefly pondered adding a 'write' version of same... but that doesn't seem to cope with crossing memslots either; it also assumes its caller only uses it to access within a single page. > > + *rs_state = vx->current_runstate; > > +#ifdef CONFIG_X86_64 > > + /* Don't leak kernel memory through the padding in the 64-bit struct */ > > + if (rs_state == &rs.state) > > + rs_state[1] = 0; > > Oof, that's difficult to follow. Rather than pointer magic, what about zeroing > the first word unconditionally? Likely faster than a CMP+CMOV or whatever gets > generated. Or just zero the entire struct. > > /* Blah blah blah */ > *(unsigned long *)&rs.state = 0; That gives us pointer aliasing problems, which I hate even if we *do* compile with -fno-strict-aliasing. I'll use a memset which ought to end up being precisely the same in practice. #ifdef CONFIG_X86_64 /* * Don't leak kernel memory through the padding in the 64-bit * struct if we take the split-page path. */ memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time)); #endif
Attachment:
smime.p7s
Description: S/MIME cryptographic signature