On Tue, Mar 05, 2024, Oliver Upton wrote: > On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote: > > [...] > > > The presense of MTE stuff shouldn't affect the fundamental access information, > > "When FEAT_MTE is implemented, for a synchronous Data Abort on an > instruction that directly accesses Allocation Tags, ISV is 0." > > If there is no instruction syndrome, there's insufficient fault context > to determine if the guest was doing a read or a write. > > > e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE > > irrespective of whether or not MTE is in play. > > When the MMU generates such an abort, it *is not* read, write, or execute. > It is a NoTagAccess fault. There is no sane way to describe this in > terms of RWX. RWX=0, with KVM_MEMORY_EXIT_FLAG_MTE seems like a reasonable way to communicate that, no? > > > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for > > > > "caching" emulated MMIO accesses, and the resulting fault captures the > > > > "reserved bits set" information in register state. But that's purely an > > > > (optional) imlementation detail of KVM that should never be exposed to > > > > userspace. > > > > > > MMIO accesses would show up elsewhere though, right? > > > > Yes, but I don't see how that's relevant. Maybe I'm just misunderstanding what > > you're saying/asking. > > If "reserved" EPT violations found their way to userspace via the > "memory fault" exit structure then that'd likely be due to a KVM bug. Heh, sadly no. Userspace can convert a GFN to private at any time, and the TDX and SNP specs allow for implicit converstion "requests" from the guest, i.e. KVM isn't allowed to kill the guest if the guest accesses a GFN with the "wrong" private/shared attribute. This particular case would likely be hit only if there's a userspace and/or guest bug, but whether the setup is broken or just unique isn't KVM's call to make. > The only expected flows in the near term are this and CoCo crap. > > > > Either way, I have no issues whatsoever if the direction for x86 is to > > > provide abstracted fault information. > > > > I don't understand how ARM can get away with NOT providing a layer of abstraction. > > Copying fault state verbatim to userspace will bleed KVM implementation details > > into userspace, > > The memslot flag already bleeds KVM implementation detail into userspace > to a degree. The event we're trying to let userspace handle is at the > intersection of a specific hardware/software state. Yes, but IMO there's a huge difference between userspace knowing that KVM uses gup() and memslots to translate gfn=>hva=>pfn, or even knowing that KVM plays games with reserved stage-2 PTE bits, and userspace knowing exactly how KVM configures stage-2 PTEs. Another example would be dirty logging on Intel CPUs. The *admin* can decide whether to use a write-protection scheme or page-modification logging, but KVM's ABI with userspace provides a layer of abstraction (dirty ring or bitmap) so that the userspace VMM doesn't need to do X for write-protection and Y for PML. > > Abstracting gory hardware details from userspace is one of the main roles of the > > kernel. > > Where it can be accomplished without a loss (or misrepresentation) of > information, agreed. But KVM UAPI is so architecture-specific that it > seems arbitrary to draw the line here. I don't think it's arbitrary. KVM's existing uAPI for mapping memory into the guest is almost entirely arch-neutral, and I want to preserve that for related functionality unless it's literally impossible to do so. > > A concrete example of hardware throwing a wrench in things is AMD's upcoming > > "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable > > CPUs for *any* VM that supports guest-controlled encrypted memory. If KVM reported > > the page fault error code directly to userspace, then running the same VM on > > different hardware generations, e.g. after live migration, would generate different > > error codes. > > > > Are we talking past each other? I'm genuinely confused by the pushback on > > capturing RWX information. Yes, the RWX info may be insufficient in some cases, > > but its existence doesn't preclude KVM from providing more information as needed. > > My pushback isn't exactly on RWX (even though I noted the MTE quirk > above). What I'm poking at here is the general infrastructure for > reflecting faults into userspace, which is aggressively becoming more > relevant. But the purpose of memory_fault isn't to reflect faults into userspace, it's to alert userspace that KVM has encountered a memory fault that requires userspace action to resolve. That distinction matters because there are and will be MMU features that KVM supports, and that can generate novel faults, but such faults won't be punted to userspace unless KVM provides a way for userspace to explicitly control the MMU feature. And if KVM lets userspace control a feature, then KVM needs new uAPI to expose the controls. Which means that we should always have an opportunity to expand memory_fault, e.g. with new flags, to support such features.