On Fri, Jun 24, 2022 at 04:54:26PM +0800, Chao Peng wrote: > On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote: > > On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote: > > > On Fri, May 20, 2022, Andy Lutomirski wrote: > > > > The alternative would be to have some kind of separate table or bitmap (part > > > > of the memslot?) that tells KVM whether a GPA should map to the fd. > > > > > > > > What do you all think? > > > > > > My original proposal was to have expolicit shared vs. private memslots, and punch > > > holes in KVM's memslots on conversion, but due to the way KVM (and userspace) > > > handle memslot updates, conversions would be painfully slow. That's how we ended > > > up with the current propsoal. > > > > > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement > > > and wouldn't necessarily even need to interact with the memslots. It could be a > > > consumer of memslots, e.g. if we wanted to disallow registering regions without an > > > associated memslot, but I think we'd want to avoid even that because things will > > > get messy during memslot updates, e.g. if dirty logging is toggled or a shared > > > memory region is temporarily removed then we wouldn't want to destroy the tracking. > > > > > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > > > should be far more efficient. > > > > > > One benefit to explicitly tracking this in KVM is that it might be useful for > > > software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending" > > > based on guest hypercalls to share/unshare memory, and then complete the transaction > > > when userspace invokes the ioctl() to complete the share/unshare. > > > > Another upside to implementing a KVM ioctl is basically the reverse of the > > discussion around avoiding double-allocations: *supporting* double-allocations. > > > > One thing I noticed while testing SNP+UPM support is a fairly dramatic > > slow-down with how it handles OVMF, which does some really nasty stuff > > with DMA where it takes 1 or 2 pages and flips them between > > shared/private on every transaction. Obviously that's not ideal and > > should be fixed directly at some point, but it's something that exists in the > > wild and might not be the only such instance where we need to deal with that > > sort of usage pattern. > > > > With the current implementation, one option I had to address this was to > > disable hole-punching in QEMU when doing shared->private conversions: > > > > Boot time from 1GB guest: > > SNP: 32s > > SNP+UPM: 1m43s > > SNP+UPM (disable shared discard): 1m08s > > > > Of course, we don't have the option of disabling discard/hole-punching > > for private memory to see if we get similar gains there, since that also > > doubles as the interface for doing private->shared conversions. > > Private should be the same, minus time consumed for private memory, the > data should be close to SNP case. You can't try that in current version > due to we rely on the existence of the private page to tell a page is > private. > > > A separate > > KVM ioctl to decouple these 2 things would allow for that, and allow for a > > way for userspace to implement things like batched/lazy-discard of > > previously-converted pages to deal with cases like these. > > The planned ioctl includes two responsibilities: > - Mark the range as private/shared > - Zap the existing SLPT mapping for the range > > Whether doing the hole-punching or not on the fd is unrelated to this > ioctl, userspace has freedom to do that or not. Since we don't reply on > the fact that private memoy should have been allocated, we can support > lazy faulting and don't need explicit fallocate(). That means, whether > the memory is discarded or not in the memory backing store is not > required by KVM, but be a userspace option. Nice, that sounds promising. > > > > > Another motivator for these separate ioctl is that, since we're considering > > 'out-of-band' interactions with private memfd where userspace might > > erroneously/inadvertently do things like double allocations, another thing it > > might do is pre-allocating pages in the private memfd prior to associating > > the memfd with a private memslot. Since the notifiers aren't registered until > > that point, any associated callbacks that would normally need to be done as > > part of those fallocate() notification would be missed unless we do something > > like 'replay' all the notifications once the private memslot is registered and > > associating with a memfile notifier. But that seems a bit ugly, and I'm not > > sure how well that would work. This also seems to hint at this additional > > 'conversion' state being something that should be owned and managed directly > > by KVM rather than hooking into the allocations. > > Right, once we move the private/shared state into KVM then we don't rely > on those callbacks so the 'replay' thing is unneeded. fallocate() > notification is useless for sure, invalidate() is likely still needed, > just like the invalidate for mmu_notifier to bump the mmu_seq and do the > zap. Ok, yah, makes sense that we'd still up needing the invalidation hooks. > > > > > It would also nicely solve the question of how to handle in-place > > encryption, since unlike userspace, KVM is perfectly capable of copying > > data from shared->private prior to conversion / guest start, and > > disallowing such things afterward. Would just need an extra flag basically. > > Agree it's possible to do additional copy during the conversion but I'm > not so confident this is urgent and the right API. Currently TDX does > not have this need. Maybe as the first step just add the conversion > itself. Adding additional feature like this can always be possible > whenever we are clear. That seems fair. In the meantime we can adopt the approach proposed by Sean and Vishal[1] and handle it directly in the relevant SNP KVM ioctls. If we end up keeping that approach we'll probably want to make sure these KVM-driven 'implicit' conversions are documented in the KVM/SNP API so that userspace can account for it in it's view of what's private/shared. In this case at least it's pretty obvious, just thinking of when other archs and VMMs utilizing this more. Thanks! -Mike [1] https://lore.kernel.org/kvm/20220524205646.1798325-4-vannapurve@xxxxxxxxxx/T/#m1e9bb782b1bea66c36ae7c4c9f4f0c35c2d7e338 > > Thanks, > Chao