On Wed, Jan 15, 2025 at 10:32:13AM -0400, Jason Gunthorpe wrote: > On Tue, Jan 14, 2025 at 11:13:48PM +0000, Ankit Agrawal wrote: > > > Do we really want another weirdly defined VMA flag? I'd really like to > > > avoid this.. > > > > I'd let Catalin chime in on this. My take of the reason for his suggestion is > > that we want to reduce the affected configs to only the NVIDIA grace based > > systems. The nvgrace-gpu module would be setting the flag and the > > new codepath will only be applicable there. Or am I missing something here? > > We cannot add VMA flags that are not clearly defined. The rules for > when the VMA creater should set the flag need to be extermely clear > and well defined. > > > > Can't we do a "this is a weird VM_PFNMAP thing, let's consult the VMA > > > prot + whatever PFN information to find out if it is weird-device and > > > how we could safely map it?" > > > > My understanding was that the new suggested flag VM_FORCE_CACHED > > was essentially to represent "whatever PFN information to find out if it is > > weird-device". Is there an alternate reliable check to figure this out? > > For instance FORCE_CACHED makes no sense, how will the VMA creator > know it should set this flag? > > > Currently in the patch we check the following. So Jason, is the suggestion that > > we simply return error to forbid such condition if VM_PFNMAP is set? > > + else if (!mte_allowed && kvm_has_mte(kvm)) > > I really don't know enought about mte, but I would take the position > that VM_PFNMAP does not support MTE, and maybe it is even any VMA > without VM_MTE/_ALLOWED does not support MTE? > > At least it makes alost more sense for the VMA creator to indicate > positively that the underlying HW supports MTE. Sorry, I didn't get the chance to properly read this thread. I'll try tomorrow and next week. Basically I don't care whether MTE is supported on such vma, I doubt you'd want to enable MTE anyway. But the way MTE was designed in the Arm architecture, prior to FEAT_MTE_PERM, it allows a guest to enable MTE at Stage 1 when Stage 2 is Normal WB Cacheable. We have no idea what enable MTE at Stage 1 means if the memory range doesn't support it. It could be external aborts, SError or simply accessing data (as tags) at random physical addresses that don't belong to the guest. So if a vma does not have VM_MTE_ALLOWED, we either disable Stage 2 cacheable or allow it with FEAT_MTE_PERM (patches from Aneesh on the list). Or, a bigger happen, disable MTE in guests (well, not that big, not many platforms supporting MTE, especially in the enterprise space). A second problem, similar to relaxing to Normal NC we merged last year, we can't tell what allowing Stage 2 cacheable means (SError etc). That's why I thought this knowledge lies with the device, KVM doesn't have the information. Checking vm_page_prot instead of a VM_* flag may work if it's mapped in user space but this might not always be the case. I don't see how VM_PFNMAP alone can tell us anything about the access properties supported by a device address range. Either way, it's the driver setting vm_page_prot or some VM_* flag. KVM has no clue, it's just a memory slot. A third aspect, more of a simplification when reasoning about this, was to use FWB at Stage 2 to force cacheability and not care about cache maintenance, especially when such range might be mapped both in user space and in the guest. -- Catalin