On Sat, 7 Jul 2018 22:44:45 +1000 Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > On Sat, 7 Jul 2018 21:43:03 +1000 > Nicholas Piggin <npiggin@xxxxxxxxx> wrote: > > > On Sat, 7 Jul 2018 20:44:10 +1000 > > Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > > > > > A VM which has: > > > - a DMA capable device passed through to it (eg. network card); > > > - running a malicious kernel that ignores H_PUT_TCE failure; > > > - capability of using IOMMU pages bigger that physical pages > > > can create an IOMMU mapping that exposes (for example) 16MB of > > > the host physical memory to the device when only 64K was allocated to the VM. > > > > > > The remaining 16MB - 64K will be some other content of host memory, possibly > > > including pages of the VM, but also pages of host kernel memory, host > > > programs or other VMs. > > > > > > The attacking VM does not control the location of the page it can map, > > > and is only allowed to map as many pages as it has pages of RAM. > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > > > an IOMMU page is contained in the physical page so the PCI hardware won't > > > get access to unassigned host memory; however this check is missing in > > > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and > > > did not hit this yet as the very first time when the mapping happens > > > we do not have tbl::it_userspace allocated yet and fall back to > > > the userspace which in turn calls VFIO IOMMU driver, this fails and > > > the guest does not retry, > > > > > > This stores the smallest preregistered page size in the preregistered > > > region descriptor and changes the mm_iommu_xxx API to check this against > > > the IOMMU page size. This calculates maximum page size as a minimum of > > > the natural region alignment and hugepagetlb's compound page size. > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > > > --- > > > Changes: > > > v5: > > > * only consider compound pages from hugetlbfs > > > > > > v4: > > > * reimplemented max pageshift calculation > > > > > > v3: > > > * fixed upper limit for the page size > > > * added checks that we don't register parts of a huge page > > > > > > v2: > > > * explicitely check for compound pages before calling compound_order() > > > > > > --- > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > > > for IOMMU pages without checking the mmu pagesize and this will fail > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > > > > > With the change, mapping will fail in KVM and the guest will print: > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0) > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0 > > > mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1 > > > --- > > > arch/powerpc/include/asm/mmu_context.h | 4 ++-- > > > arch/powerpc/kvm/book3s_64_vio.c | 2 +- > > > arch/powerpc/kvm/book3s_64_vio_hv.c | 6 ++++-- > > > arch/powerpc/mm/mmu_context_iommu.c | 33 +++++++++++++++++++++++++++------ > > > drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- > > > 5 files changed, 35 insertions(+), 12 deletions(-) > > > > > > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h > > > index 896efa5..79d570c 100644 > > > --- a/arch/powerpc/include/asm/mmu_context.h > > > +++ b/arch/powerpc/include/asm/mmu_context.h > > > @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm( > > > extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, > > > unsigned long ua, unsigned long entries); > > > extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > > > - unsigned long ua, unsigned long *hpa); > > > + unsigned long ua, unsigned int pageshift, unsigned long *hpa); > > > extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, > > > - unsigned long ua, unsigned long *hpa); > > > + unsigned long ua, unsigned int pageshift, unsigned long *hpa); > > > extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); > > > extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem); > > > #endif > > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > > > index d066e37..8c456fa 100644 > > > --- a/arch/powerpc/kvm/book3s_64_vio.c > > > +++ b/arch/powerpc/kvm/book3s_64_vio.c > > > @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl, > > > /* This only handles v2 IOMMU type, v1 is handled via ioctl() */ > > > return H_TOO_HARD; > > > > > > - if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa))) > > > + if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa))) > > > return H_HARDWARE; > > > > > > if (mm_iommu_mapped_inc(mem)) > > > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c > > > index 925fc31..5b298f5 100644 > > > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > > > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > > > @@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl, > > > if (!mem) > > > return H_TOO_HARD; > > > > > > - if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))) > > > + if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift, > > > + &hpa))) > > > return H_HARDWARE; > > > > > > pua = (void *) vmalloc_to_phys(pua); > > > @@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, > > > > > > mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K); > > > if (mem) > > > - prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0; > > > + prereg = mm_iommu_ua_to_hpa_rm(mem, ua, > > > + IOMMU_PAGE_SHIFT_4K, &tces) == 0; > > > } > > > > > > if (!prereg) { > > > diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c > > > index abb4364..0aebed6 100644 > > > --- a/arch/powerpc/mm/mmu_context_iommu.c > > > +++ b/arch/powerpc/mm/mmu_context_iommu.c > > > @@ -18,6 +18,7 @@ > > > #include <linux/migrate.h> > > > #include <linux/hugetlb.h> > > > #include <linux/swap.h> > > > +#include <linux/hugetlb.h> > > > #include <asm/mmu_context.h> > > > > > > static DEFINE_MUTEX(mem_list_mutex); > > > @@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t { > > > struct rcu_head rcu; > > > unsigned long used; > > > atomic64_t mapped; > > > + unsigned int pageshift; > > > u64 ua; /* userspace address */ > > > u64 entries; /* number of entries in hpas[] */ > > > u64 *hpas; /* vmalloc'ed */ > > > @@ -125,6 +127,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > { > > > struct mm_iommu_table_group_mem_t *mem; > > > long i, j, ret = 0, locked_entries = 0; > > > + unsigned int pageshift; > > > struct page *page = NULL; > > > > > > mutex_lock(&mem_list_mutex); > > > @@ -159,6 +162,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > goto unlock_exit; > > > } > > > > > > + /* > > > + * Use @ua and @entries alignment as a starting point for a maximum > > > + * page size calculation below. > > > + */ > > > + mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT)); > > > mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0]))); > > > if (!mem->hpas) { > > > kfree(mem); > > > > I think it would be better to start with UINT_MAX or something to make > > it clear that it's not a valid "guess" but rather just a starting point > > for the min(). > > This is to handle unaligned @ua/@entries backed by a huge page. For > example, 1GB page and we are preregistering only a second half of it - > the maximum pageshift should be 29 instead of 30 which it will be if we > start from 64. Ah okay that's what I'm missing. Maybe a small addition to the comment? > > > > > @@ -167,8 +175,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > > > } > > > > > > for (i = 0; i < entries; ++i) { > > > - if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT), > > > - 1/* pages */, 1/* iswrite */, &page)) { > > > + struct vm_area_struct *vma = NULL; > > > + > > > + if (1 != get_user_pages(ua + (i << PAGE_SHIFT), > > > + 1/* pages */, 1/* iswrite */, &page, > > > + &vma)) { > > > ret = -EFAULT; > > > for (j = 0; j < i; ++j) > > > put_page(pfn_to_page(mem->hpas[j] >> > > > > You need mmap_sem for read here, and held while you inspect the vma. > > I would say don't hold it over the error handling and the > > mm_iommu_move_page_from_cma() call though. > > > Ah, this is why it deadlocked when I tried this. > > But do I really need mmap_sem here? get_user_pages_longterm() inspects > vma with no mmap_sem held in case of VFIO IOMMU Type1: > > vfio_fops_unl_ioctl > vfio_iommu_type1_ioctl > vfio_dma_do_map > vfio_pin_map_dma > vfio_pin_pages_remote > vaddr_get_pfn > get_user_pages_longterm > > Is that a bug? Other cases seem to hold it though. Yes you need mmap_sem. I think that would be a bug too. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html