On Wed, 11 Jul 2018 21:00:44 +1000 Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > A VM which has: > - a DMA capable device passed through to it (eg. network card); > - running a malicious kernel that ignores H_PUT_TCE failure; > - capability of using IOMMU pages bigger that physical pages > can create an IOMMU mapping that exposes (for example) 16MB of > the host physical memory to the device when only 64K was allocated to the VM. > > The remaining 16MB - 64K will be some other content of host memory, possibly > including pages of the VM, but also pages of host kernel memory, host > programs or other VMs. > > The attacking VM does not control the location of the page it can map, > and is only allowed to map as many pages as it has pages of RAM. > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > an IOMMU page is contained in the physical page so the PCI hardware won't > get access to unassigned host memory; however this check is missing in > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and > did not hit this yet as the very first time when the mapping happens > we do not have tbl::it_userspace allocated yet and fall back to > the userspace which in turn calls VFIO IOMMU driver, this fails and > the guest does not retry, > > This stores the smallest preregistered page size in the preregistered > region descriptor and changes the mm_iommu_xxx API to check this against > the IOMMU page size. > > This calculates maximum page size as a minimum of the natural region > alignment and compound page size. For the page shift this uses the shift > returned by find_linux_pte() which indicates how the page is mapped to > the current userspace - if the page is huge and this is not a zero, then > it is a leaf pte and the page is mapped within the range. > > Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > @@ -199,6 +209,25 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, > } > } > populate: > + pageshift = PAGE_SHIFT; > + if (PageCompound(page)) { > + pte_t *pte; > + struct page *head = compound_head(page); > + unsigned int compshift = compound_order(head); > + > + local_irq_save(flags); /* disables as well */ > + pte = find_linux_pte(mm->pgd, ua, NULL, &pageshift); > + local_irq_restore(flags); > + if (!pte) { > + ret = -EFAULT; > + goto unlock_exit; > + } > + /* Double check it is still the same pinned page */ > + if (pte_page(*pte) == head && pageshift == compshift) > + pageshift = max_t(unsigned int, pageshift, > + PAGE_SHIFT); I don't understand this logic. If the page was different, the shift would be wrong. You're not retrying but instead ignoring it in that case. I think I would be slightly happier with the definitely-not-racy get_user_pages slow approach. Anything lock-less like this would be a premature optimisation without performance numbers... Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html