On Mon, May 23, 2022 at 03:22:32PM +0000, Sean Christopherson wrote: > On Mon, May 23, 2022, Chao Peng wrote: > > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > > > On Fri, May 20, 2022, Andy Lutomirski wrote: > > > > The alternative would be to have some kind of separate table or bitmap (part > > > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > > > > > What do you all think? > > > > > > My original proposal was to have expolicit shared vs. private memslots, and punch > > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > > > handle memslot updates, conversions would be painfully slow. That's how we ended > > > up with the current propsoal. > > > > > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > > > and wouldn't necessarily even need to interact with the memslots. It could be a > > > consumer of memslots, e.g. if we wanted to disallow registering regions without an > > > associated memslot, but I think we'd want to avoid even that because things will > > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > > > memory region is temporarily removed then we wouldn't want to destroy the tracking. > > > > Even we don't tight that to memslots, that info can only be effective > > for private memslot, right? Setting this ioctl to memory ranges defined > > in a traditional non-private memslots just makes no sense, I guess we can > > comment that in the API document. > > Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be > declared "shared". But, applying it selectively would arguably be worse, e.g. > letting userspace map memory into the guest as shared for a region that's registered > as private... > > On option to that mess would be to make memory shared by default, and so userspace > must declare regions that are private. Then there's no weirdness with emulated MMIO > or "legacy" memslots. > > On page fault, KVM does a lookup to see if the GPA is shared or private. If the > GPA is private, but there is no memslot or the memslot doesn't have a private fd, > KVM exits to userspace. If there's a memslot with a private fd, the shared/private > flag is used to resolve the > > And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the > notifier sequence, i.e. force the page fault to retry if the GPA may have been > (un)registered between checking the type and acquiring mmu_lock. Yeah, that makes sense. > > > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > > > should be far more efficient. > > > > What about the mis-behaved guest? I don't want to design for the worst > > case, but people may raise concern on the attack from such guest. > > That's why cgroups exist. E.g. a malicious/broken L1 can similarly abuse nested > EPT/NPT to generate a large number of shadow page tables. I havn't seen we had that in KVM. Is there any plan/discussion to add that? > > > > One benefit to explicitly tracking this in KVM is that it might be useful for > > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > > > based on guest hypercalls to share/unshare memory, and then complete the transaction > > > when userspace invokes the ioctl() to complete the share/unshare. > > > > OK, then this can be another field of states/flags/attributes. Let me > > dig up certain level of details: > > > > First, introduce below KVM ioctl > > > > KVM_SET_MEMORY_ATTR > > Actually, if the semantics are that userspace declares memory as private, then we > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION. It'd > be a little gross because we'd need to slightly redefine the semantics for TDX, SNP, > and software-protected VM types, e.g. the ioctls() currently require a pre-exisitng > memslot. But I think it'd work... These existing ioctls looks good for TDX and probably SNP as well. For softrware-protected VM types, it may not be enough. Maybe for the first step we can reuse this for all hardware based solutions and invent new interface when software-protected solution gets really supported. There is semantics difference for fd-based private memory. Current above two ioctls() use userspace addreess(hva) while for fd-based it should be fd+offset, and probably it's better to use gpa in this case. Then we will need change existing semantics and break backward-compatibility. Chao > > I'll think more on this...