On Tue, Nov 22, 2022, David Woodhouse wrote: > On Tue, 2022-11-22 at 18:39 +0000, Sean Christopherson wrote: > > On Sat, Nov 19, 2022, David Woodhouse wrote: > > > From: David Woodhouse < > > > dwmw@xxxxxxxxxxxx > > > > > > > > > > The guest runstate area can be arbitrarily byte-aligned. In fact, even > > > when a sane 32-bit guest aligns the overall structure nicely, the 64-bit > > > fields in the structure end up being unaligned due to the fact that the > > > 32-bit ABI only aligns them to 32 bits. > > > > > > So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE > > > is buggy, because if it's unaligned then we can't update the whole field > > > atomically; the low bytes might be observable before the _UPDATE bit is. > > > Xen actually updates the *byte* containing that top bit, on its own. KVM > > > should do the same. > > > > I think we're using the wrong APIs to update the runstate. The VMCS/VMCB pages > > _need_ the host pfn, i.e. need to use a gpc (eventually). The Xen PV stuff on the > > other hand most definitely doesn't need to know the pfn. > > > > The event channel code would be difficult to convert due to lack of uaccess > > primitives, but I don't see anything in the runstate code that prevents KVM from > > using a gfn_to_hva_cache. That will naturally handle page splits by sending them > > down a slow path and would yield far simpler code. > > > > If taking the slow path is an issue, then the guest really should be fixed to not > > split pages. And if that's not an acceptable answer, the gfn_to_hva_cache code > > could be updated to use the fast path if the region is contiguous in the host > > virtual address space. > > > > Yeah, that's tempting. Going back to gfn_to_hva_cache was the first > thing I tried. There are a handful of complexifying factors, none of > which are insurmountable if we try hard enough. > > • Even if we fix the gfn_to_hva_cache to still use the fast path for > more than the first page, it's still possible for the runstate to > cross from one memslot to an adjacent one. We probably still need > two of them, which is a large part of the ugliness of this patch. Hrm. What if KVM requires that the runstate be contiguous in host virtual address space? That wouldn't violate Xen's ABI since it's a KVM restriction, and it can't break backwards compatibility since KVM doesn't even handle page splits, let alone memslot splits. AFAIK, most (all?) userspace VMMs make guest RAM virtually contiguous anyways. Probably a moot point since I don't see a way around the "page got swapped out" issue. > • Accessing it via the userspace HVA requires coping with the case > where the vCPU is being scheduled out, and it can't sleep. The > previous code before commit 34d41d19e dealt with that by using > pagefault_disable() and depending on the GHC fast path. But even > so... > > • We also could end up having to touch page B, page A then page B > again, potentially having to abort and leave the runstate with > the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the > version which checks that both pages are present before it starts > writing anything to the guest. Ugh, and because of the in-atomic case, there's nothing KVM can do to remedy page B getting swapped out. Actually, it doesn't even require a B=>A=>B pattern. Even without a page split, the backing page could get swapped out between setting and clearing. > As I said, they're not insurmountable but that's why I ended up with > the patch you see, after first attempting to use a gfn_to_hva_cache > again. Happy to entertain suggestions. What if we use a gfn_to_pfn_cache for the page containing XEN_RUNSTATE_UPDATE, and then use kvm_vcpu_write_guest_atomic() if there's a page split? That would avoid needing to acquire mutliple gpc locks, and the update could use a single code flow by always constructing an on-stack representation. E.g. very roughly: *update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56; smp_wmb(); if (khva) memcpy(khva, rs_state, user_len); else kvm_vcpu_write_guest_in_atomic(vcpu, gpa, rs_state, user_len); smp_wmb(); *update_bit = vx->runstate_entry_time >> 56; smp_wmb(); where "khva" is NULL if there's a page split. > I note we have a kvm_read_guest_atomic() and briefly pondered adding a > 'write' version of same... but that doesn't seem to cope with crossing > memslots either; it also assumes its caller only uses it to access > within a single page. We should fix the lack of page split handling. There are no bugs because KVM only uses the helper for cases where the accesses are naturally aligned, but that's bound to bite someone eventually. That would be an oppurtunity to dedup the code that handles "segments" (ugh), e.g. instead of gfn_t gfn = gpa >> PAGE_SHIFT; int seg; int offset = offset_in_page(gpa); int ret; while ((seg = next_segment(len, offset)) != 0) { ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg); if (ret < 0) return ret; offset = 0; len -= seg; data += seg; ++gfn; } return 0; provide a macro that allows kvm_for_each_chunk(...) ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg); if (ret) return ret; } I also think we should rename the "atomic" helpers to be "in_atomic" (or inatomic if we want to follow the kernel's terrible nomenclature, which I always read as "not atomic"). I always think read_guest_atomic() means an "atomic read".