On Mon, Oct 24, 2022, Alexander Graf wrote: > Hey Sean, > > On 21.10.22 21:40, Sean Christopherson wrote: > > > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > On 20.10.22 22:37, Sean Christopherson wrote: > > > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > > > On 26.06.20 19:32, Sean Christopherson wrote: > > > > > > /cast <thread necromancy> > > > > > > > > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > > > > > [...] > > > > > > > > > > > I don't think any of this explains the pass-through GPU issue. But, we > > > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm > > > > > > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > > > > > > set the record straight for posterity before doing so. > > > > > Hey Sean, > > > > > > > > > > Did you ever get around to upstream or rework the zap optimization? The way > > > > > I read current upstream, a memslot change still always wipes all SPTEs, not > > > > > only the ones that were changed. > > > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot. > > > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very > > > > much a special case. > > > > > > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? > > > > > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which > > > Windows uses to implement Credential Guard. With that, the guest gets access > > > to hypercalls that allow it to set reduced permissions for arbitrary gfns. > > > To ensure that user space has full visibility into those for live migration, > > > memory slots to model access would be a great fit. But it means we'd do > > > ~100k memslot modifications on boot. > > Oof. 100k memslot updates is going to be painful irrespective of flushing. And > > memslots (in their current form) won't work if the guest can drop executable > > permissions. > > > > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve > > the "KVM flushes everything on memslot deletion", I think we should instead > > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing > > userspace to delete the memslot. Commit 75d61fbcf563 ("KVM: set_memory_region: > > > That would be a cute acceleration for the case where we have to change > permissions for a full slot. Unfortunately, the bulk of the changes are slot > splits. Ah, right, the guest will be operating on per-page granularity. > We already built a prototype implementation of an atomic memslot update > ioctl that allows us to keep other vCPUs running while we do the > delete/create/create/create operation. Please weigh in with your use case on a relevant upstream discussion regarding "atomic" memslot updates[*]. I suspect we'll end up with a different solution for this use case (see below), but we should at least capture all potential use cases and ideas for modifying memslots without pausing vCPUs. [*] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@xxxxxxxxxx > But even with that, we see up to 30 min boot times for larger guests that > most of the time are stuck in zapping pages. Out of curiosity, did you measure runtime performance? I would expect some amount of runtime overhead as well dut to fragmenting memslots to that degree. > I guess we have 2 options to make this viable: > > 1) Optimize memslot splits + modifications to a point where they're fast > enough > 2) Add a different, faster mechanism on top of memslots for page granular > permission bits #2 crossed my mind as well. This is actually nearly identical to the confidential VM use case, where KVM needs to handle guest-initiated conversions of memory between "private" and "shared" on a per-page granularity. The proposed solution for that is indeed a layer on top of memslots[*], which we arrived at in no small part because splitting memslots was going to be a bottleneck. Extending the proposed mem_attr_array to support additional state should be quite easy. The framework is all there, KVM just needs a few extra flags values, e.g. KVM_MEM_ATTR_SHARED BIT(0) KVM_MEM_ATTR_READONLY BIT(1) KVM_MEM_ATTR_NOEXEC BIT(2) and then new ioctls to expose the functionality to userspace. Actually, if we want to go this route, it might even make sense to define new a generic MEM_ATTR ioctl() right away instead of repurposing KVM_MEMORY_ENCRYPT_(UN)REG_REGION for the private vs. shared use case. [*] https://lore.kernel.org/all/20220915142913.2213336-6-chao.p.peng@xxxxxxxxxxxxxxx