On Mon, Mar 08, 2021 at 03:11:41PM -0600, Brijesh Singh wrote: > > On 3/8/21 1:51 PM, Sean Christopherson wrote: > > On Mon, Mar 08, 2021, Ashish Kalra wrote: > >> On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > >>> +Will and Quentin (arm64) > >>> > >>> Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at this > >>> point. > >>> > >>> On Fri, Feb 26, 2021, Ashish Kalra wrote: > >>>> On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > >>>>> On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra <ashish.kalra@xxxxxxx> wrote: > >>>>> Thanks for grabbing the data! > >>>>> > >>>>> I am fine with both paths. Sean has stated an explicit desire for > >>>>> hypercall exiting, so I think that would be the current consensus. > >>> Yep, though it'd be good to get Paolo's input, too. > >>> > >>>>> If we want to do hypercall exiting, this should be in a follow-up > >>>>> series where we implement something more generic, e.g. a hypercall > >>>>> exiting bitmap or hypercall exit list. If we are taking the hypercall > >>>>> exit route, we can drop the kvm side of the hypercall. > >>> I don't think this is a good candidate for arbitrary hypercall interception. Or > >>> rather, I think hypercall interception should be an orthogonal implementation. > >>> > >>> The guest, including guest firmware, needs to be aware that the hypercall is > >>> supported, and the ABI needs to be well-defined. Relying on userspace VMMs to > >>> implement a common ABI is an unnecessary risk. > >>> > >>> We could make KVM's default behavior be a nop, i.e. have KVM enforce the ABI but > >>> require further VMM intervention. But, I just don't see the point, it would > >>> save only a few lines of code. It would also limit what KVM could do in the > >>> future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to userspace, > >>> then mandatory interception would essentially make it impossible for KVM to do > >>> bookkeeping while still honoring the interception request. > >>> > >>> However, I do think it would make sense to have the userspace exit be a generic > >>> exit type. But hey, we already have the necessary ABI defined for that! It's > >>> just not used anywhere. > >>> > >>> /* KVM_EXIT_HYPERCALL */ > >>> struct { > >>> __u64 nr; > >>> __u64 args[6]; > >>> __u64 ret; > >>> __u32 longmode; > >>> __u32 pad; > >>> } hypercall; > >>> > >>> > >>>>> Userspace could also handle the MSR using MSR filters (would need to > >>>>> confirm that). Then userspace could also be in control of the cpuid bit. > >>> An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. > >>> The data limitation could be fudged by shoving data into non-standard GPRs, but > >>> that will result in truly heinous guest code, and extensibility issues. > >>> > >>> The data limitation is a moot point, because the x86-only thing is a deal > >>> breaker. arm64's pKVM work has a near-identical use case for a guest to share > >>> memory with a host. I can't think of a clever way to avoid having to support > >>> TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > >>> multiple KVM variants. > >>> > >> Potentially, there is another reason for in-kernel hypercall handling > >> considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state > >> of each guest page, for instance pages in hypervisor state, i.e., pages > >> with C=0 and pages in guest valid state with C=1. > >> > >> Now, there shouldn't be a need for page encryption status hypercalls on > >> SEV-SNP as KVM can track & reference guest page status directly using > >> the RMP table. > > Relying on the RMP table itself would require locking the RMP table for an > > extended duration, and walking the entire RMP to find shared pages would be > > very inefficient. > > > >> As KVM maintains the RMP table, therefore we will need SET/GET type of > >> interfaces to provide the guest page encryption status to userspace. > > Hrm, somehow I temporarily forgot about SNP and TDX adding their own hypercalls > > for converting between shared and private. And in the case of TDX, the hypercall > > can't be trusted, i.e. is just a hint, otherwise the guest could induce a #MC in > > the host. > > > > But, the different guest behavior doesn't require KVM to maintain a list/tree, > > e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page encryption > > status changes would also suffice. > > > > Actually, that made me think of another argument against maintaining a list in > > KVM: there's no way to notify userspace that a page's status has changed. > > Userspace would need to query KVM to do GET_LIST after every GET_DIRTY. > > Obviously not a huge issue, but it does make migration slightly less efficient. > > > > On a related topic, there are fatal race conditions that will require careful > > coordination between guest and host, and will effectively be wired into the ABI. > > SNP and TDX don't suffer these issues because host awareness of status is atomic > > with respect to the guest actually writing the page with the new encryption > > status. > > > > For SEV live migration... > > > > If the guest does the hypercall after writing the page, then the guest is hosed > > if it gets migrated while writing the page (scenario #1): > > > > vCPU Userspace > > zero_bytes[0:N] > > <transfers written bytes as private instead of shared> > > <migrates vCPU> > > zero_bytes[N+1:4095] > > set_shared (dest) > > kaboom! > > > Maybe I am missing something, this is not any different from a normal > operation inside a guest. Making a page shared/private in the page table > does not update the content of the page itself. In your above case, I > assume zero_bytes[N+1:4095] are written by the destination VM. The > memory region was private in the source VM page table, so, those writes > will be performed encrypted. The destination VM later changed the memory > to shared, but nobody wrote to the memory after it has been transitioned > to the shared, so a reader of the memory should get ciphertext and > unless there was a write after the set_shared (dest). > > > > If userspace does GET_DIRTY after GET_LIST, then the host would transfer bad > > data by consuming a stale list (scenario #2): > > > > vCPU Userspace > > get_list (from KVM or internally) > > set_shared (src) > > zero_page (src) > > get_dirty > > <transfers private data instead of shared> > > <migrates vCPU> > > kaboom! > > > I don't remember how things are done in recent Ashish Qemu/KVM patches > but in previous series, the get_dirty() happens before the querying the > encrypted state. There was some logic in VMM to resync the encrypted > bitmap during the final migration stage and perform any additional data > transfer since last sync. > > Yes, we do that and in fact, we added logic in VMM to resync the encrypted bitmap after every migration iteration and if there is a difference in encrypted page states, then we perform additional data transfers corresponding to those changes. Thanks, Ashish