On Wed, Jan 18, 2023 at 04:16:41PM +0800, Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx> wrote: > On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote: > > On Tue, Jan 17, 2023, Chao Peng wrote: > > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote: > > > > > + list_for_each_entry(notifier, &data->notifiers, list) { > > > > > + notifier->ops->invalidate_start(notifier, start, end); > > > > > > > > Two major design issues that we overlooked long ago: > > > > > > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > > > > VM with a large number of convertible memslots that are all backed by a > > > > single large restrictedmem instance, then converting a single page will > > > > result in a linear walk through all memslots. I don't expect anyone to > > > > actually do something silly like that, but I also never expected there to be > > > > a legitimate usecase for thousands of memslots. > > > > > > > > 2. This approach fails to provide the ability for KVM to ensure a guest has > > > > exclusive access to a page. As discussed in the past, the kernel can rely > > > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > > > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > > > > e.g. SEV, there is value in ensuring a 1:1 association. > > > > > > > > And probably more importantly, relying on hardware for SNP and TDX yields a > > > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > > > > page is exclusive to a guest, i.e. if userspace can hand out the same page > > > > from a restrictedmem instance to multiple VMs, then failure will occur only > > > > when KVM tries to assign the page to the second VM. That will happen deep > > > > in KVM, which means KVM needs to gracefully handle such errors, and it means > > > > that KVM's ABI effectively allows plumbing garbage into its memslots. > > > > > > It may not be a valid usage, but in my TDX environment I do meet below > > > issue. > > > > > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0 > > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0 > > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22 > > > > > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU > > > and slot#2 fails due to below exclusive check. > > > > > > Currently I changed QEMU code to mark these alias slots as shared > > > instead of private but I'm not 100% confident this is correct fix. > > > > That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't > > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests. > > Thanks for the confirmation. As long as we only bind one notifier for > each address, using xarray does make things simple. In the past, I had patches for qemu to disable PAM and SMRAM, but they were dropped for simplicity because SMRAM/PAM are disabled as reset state with unused memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them. Now we can revive them. -- Isaku Yamahata <isaku.yamahata@xxxxxxxxx>