Hi Marc, On Wed, Dec 11, 2019 at 04:56:48PM +0000, Marc Zyngier wrote: > A device mapping is normally always mapped at Stage-2, since there > is very little gain in having it faulted in. It is actually becoming less clear to me what the real benefits of pre-populating the stage 2 page table are, especially given that we can provoke a situation where they're faulted in anyhow. Do you recall if we had any specific case that motivated us to pre-fault in the pages? > > Nonetheless, it is possible to end-up in a situation where the device > mapping has been removed from Stage-2 (userspace munmaped the VFIO > region, and the MMU notifier did its job), but present in a userspace > mapping (userpace has mapped it back at the same address). In such > a situation, the device mapping will be demand-paged as the guest > performs memory accesses. > > This requires to be careful when dealing with mapping size, cache > management, and to handle potential execution of a device mapping. > > Cc: stable@xxxxxxxxxxxxxxx > Reported-by: Alexandru Elisei <alexandru.elisei@xxxxxxx> > Signed-off-by: Marc Zyngier <maz@xxxxxxxxxx> > --- > virt/kvm/arm/mmu.c | 21 +++++++++++++++++---- > 1 file changed, 17 insertions(+), 4 deletions(-) > > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > index a48994af70b8..0b32a904a1bb 100644 > --- a/virt/kvm/arm/mmu.c > +++ b/virt/kvm/arm/mmu.c > @@ -38,6 +38,11 @@ static unsigned long io_map_base; > #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0) > #define KVM_S2_FLAG_LOGGING_ACTIVE (1UL << 1) > > +static bool is_iomap(unsigned long flags) > +{ > + return flags & KVM_S2PTE_FLAG_IS_IOMAP; > +} > + nit: I'm not really sure this indirection makes the code more readable, but I guess that's a matter of taste. > static bool memslot_is_logging(struct kvm_memory_slot *memslot) > { > return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY); > @@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > vma_pagesize = vma_kernel_pagesize(vma); > if (logging_active || > + (vma->vm_flags & VM_PFNMAP) || WHat is actually the rationale for this? Why is a huge mapping not permitted to device memory? Are we guaranteed that VM_PFNMAP on the vma results in device mappings? I'm not convinced this is the case, and it would be better if we can stick to a single primitive (either kvm_is_device_pfn, or VM_PFNMAP) to detect device mappings. As a subsequent patch, I'd like to make sure that at the very least our memslot prepare function follows the exact same logic for mapping device memory as a fault-in approach does, or that we simply always fault pages in. > !fault_supports_stage2_huge_mapping(memslot, hva, vma_pagesize)) { > force_pte = true; > vma_pagesize = PAGE_SIZE; > @@ -1760,6 +1766,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > writable = false; > } > > + if (exec_fault && is_iomap(flags)) > + return -ENOEXEC; > + nit: why don't you just do this when checking kvm_is_device_pfn() and avoid having logic in two places to deal with this case? > spin_lock(&kvm->mmu_lock); > if (mmu_notifier_retry(kvm, mmu_seq)) > goto out_unlock; > @@ -1781,7 +1790,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (writable) > kvm_set_pfn_dirty(pfn); > > - if (fault_status != FSC_PERM) > + if (fault_status != FSC_PERM && !is_iomap(flags)) > clean_dcache_guest_page(pfn, vma_pagesize); > > if (exec_fault) > @@ -1948,9 +1957,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) > if (kvm_is_error_hva(hva) || (write_fault && !writable)) { > if (is_iabt) { > /* Prefetch Abort on I/O address */ > - kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu)); > - ret = 1; > - goto out_unlock; > + ret = -ENOEXEC; > + goto out; > } > > /* > @@ -1992,6 +2000,11 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) > ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status); > if (ret == 0) > ret = 1; > +out: > + if (ret == -ENOEXEC) { > + kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu)); > + ret = 1; > + } > out_unlock: > srcu_read_unlock(&vcpu->kvm->srcu, idx); > return ret; > -- > 2.20.1 > I can't seem to decide for myself if I think there's a sematic difference between trying to execute from somewhere the VMM has explicitly told us is device memory and from somewhere which we happen to have mapped with VM_PFNMAP from user space. But I also can't seem to really fault it (pun intended). Thoughts? Thanks, Christoffer