On 1/8/2021 4:32 PM, Alex Williamson wrote: > On Tue, 5 Jan 2021 07:36:53 -0800 > Steve Sistare <steven.sistare@xxxxxxxxxx> wrote: > >> Block translation of host virtual address while an iova range is suspended. >> >> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx> >> --- >> drivers/vfio/vfio_iommu_type1.c | 48 ++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 45 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >> index 2c164a6..8035b9a 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -31,6 +31,8 @@ >> #include <linux/rbtree.h> >> #include <linux/sched/signal.h> >> #include <linux/sched/mm.h> >> +#include <linux/delay.h> >> +#include <linux/kthread.h> >> #include <linux/slab.h> >> #include <linux/uaccess.h> >> #include <linux/vfio.h> >> @@ -484,6 +486,34 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, >> return ret; >> } >> >> +static bool vfio_iommu_contained(struct vfio_iommu *iommu) >> +{ >> + struct vfio_domain *domain = iommu->external_domain; >> + struct vfio_group *group; >> + >> + if (!domain) >> + domain = list_first_entry(&iommu->domain_list, >> + struct vfio_domain, next); >> + >> + group = list_first_entry(&domain->group_list, struct vfio_group, next); >> + return vfio_iommu_group_contained(group->iommu_group); >> +} > > This seems really backwards for a vfio iommu backend to ask vfio-core > whether its internal object is active. Yes. I considered the architecturally purer approach of defining a new back end interface and calling it from the core when the change to uncontained occurs, but it seemed like overkill. >> + >> + >> +bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma) >> +{ >> + while (dma->suspended) { >> + mutex_unlock(&iommu->lock); >> + msleep_interruptible(10); >> + mutex_lock(&iommu->lock); >> + if (kthread_should_stop() || !vfio_iommu_contained(iommu) || >> + fatal_signal_pending(current)) { >> + return false; >> + } >> + } >> + return true; >> +} >> + >> /* >> * Attempt to pin pages. We really don't want to track all the pfns and >> * the iommu can only map chunks of consecutive pfns anyway, so get the >> @@ -690,6 +720,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, >> continue; >> } >> >> + if (!vfio_vaddr_valid(iommu, dma)) { >> + ret = -EFAULT; >> + goto pin_unwind; >> + } > > This could have dropped iommu->lock, we have no basis to believe our > vfio_dma object is still valid. Ouch, I am embarrassed. To fix, I will pass iova to vfio_vaddr_valid() and call vfio_find_dma after re-acquiring the lock. > Also this is called with an elevated > reference to the container, so we theoretically should not need to test > whether the iommu is still contained. We still have a reference to the container, but userland may have closed the container fd and hence will never call the ioctl to replace the vaddr, so we must bail. >> + >> remote_vaddr = dma->vaddr + (iova - dma->iova); >> ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i], >> do_accounting); >> @@ -1514,12 +1549,16 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, >> i += PAGE_SIZE; >> } >> } else { >> - unsigned long pfn; >> - unsigned long vaddr = dma->vaddr + >> - (iova - dma->iova); >> + unsigned long pfn, vaddr; >> size_t n = dma->iova + dma->size - iova; >> long npage; >> >> + if (!vfio_vaddr_valid(iommu, dma)) { >> + ret = -EFAULT; >> + goto unwind; >> + } > > This one is even scarier, do we have any basis to assume either object > is still valid after iommu->lock is released? Another ouch. The iommu object still exists, but its dma_list may have changed. We cannot even unwind and punt with EAGAIN. I can fix this by adding a preamble loop that holds iommu->lock and verifies that all dma_list entries are not suspended. If the loop hits a suspended entry, it drops the lock and sleeps (like vfio_vaddr_valid), and retries from the beginning of the list. On reaching the end of the list, the lock remains held and prevents entries from being suspended. Instead of retrying, I could return EAGAIN, but that would require unwinding of earlier groups in __vfio_container_attach_groups. > We can only get here if > we're attaching a group to the iommu with suspended vfio_dmas. Do we > need to allow that? It seems we could -EAGAIN and let the user figure > out that they can't attach a group while mappings have invalid vaddrs. I would like to allow it to minimize userland code changes. >> + vaddr = dma->vaddr + (iova - dma->iova); >> + >> npage = vfio_pin_pages_remote(dma, vaddr, >> n >> PAGE_SHIFT, >> &pfn, limit); >> @@ -2965,6 +3004,9 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu, >> if (count > dma->size - offset) >> count = dma->size - offset; >> >> + if (!vfio_vaddr_valid(iommu, dma)) > > Same as the pinning path, this should hold a container user reference > but we cannot assume our vfio_dma object is valid if we've released the > lock. Thanks, Yup. Same fix needed. - Steve >> + goto out; >> + >> vaddr = dma->vaddr + offset; >> >> if (write) { >