Re-architect the interfaces that allow the underlying memory object of an iova range to be mapped in a new address space. The old interfaces allow userland to indefinitely block vfio mediated device kernel threads, and do not propagate the locked_vm count to a new mm. Interface changes: - disable the VFIO_UPDATE_VADDR extension - delete VFIO_DMA_UNMAP_FLAG_VADDR - redefine VFIO_DMA_MAP_FLAG_VADDR New interfaces: - VFIO_CHANGE_DMA_OWNER iommu driver ioctl - VFIO_DMA_OWNER extension, consisting of VFIO_CHANGE_DMA_OWNER and the redefined VFIO_DMA_MAP_FLAG_VADDR. VFIO_DMA_MAP_FLAG_VADDR changes the base virtual address for a dma mapping. It is called after exec, after the application remaps the corresponding shared memory object that was preserved across exec. However, the change does not take effect until VFIO_CHANGE_DMA_OWNER is called. This allows the application to iterate and register a new vaddr for all dma's, and have them take effect atomically. VFIO_CHANGE_DMA_OWNER changes the task and mm for all dma mappings to that of the caller, and transfers the locked_vm count from the old to the new mm. The vaddr for each mapping must either be the same in the old and new mm (eg after fork), or must have been updated with VFIO_DMA_MAP_FLAG_VADDR (eg after exec). Subsequently, the caller is the only task that is allowed to pin pages for dma. This prevents an application from exceeding the initial task's RLIMIT_MEMLOCK by fork'ing and pinning in children. These interfaces can be used to implement live update, in which a process such as qemu exec's an updated version of itself, while preserving its guest and vfio devices. The application must preserve the vfio descriptors across fork and exec, and must not start each step below until the previous step has finished. parent child 1. fork 2. ioctl(VFIO_CHANGE_DMA_OWNER) 3. exec new binary 4. foreach dma mapping va = mmap() ioctl(, VFIO_DMA_MAP_FLAG_VADDR, va) ioctl(VFIO_CHANGE_DMA_OWNER) 5. exit With this arrangement, the dma mappings are always associated with a valid mm, and mediated device requests such as vfio_pin_pages and vfio_dma_rw block only briefly during the ioctls. Thanks to Jason Gunthorpe for suggesting fork and the change mm ioctl. Lastly, if a task exits or execs, and it still owns any dma mappings, they are unmapped and unpinned. This guarantees that pages do not remain pinned indefinitely if a vfio descriptor is leaked to another process, and requires tasks to explicitly transfer ownership of dma (and hence locked_vm) to a new task and mm when continued operation is desired. The vfio driver maps a special vma so it can detect exit and exec, via the vm_operations_struct close callback. Steve Sistare (8): vfio: delete interfaces to update vaddr vfio/type1: dma owner permission vfio: close dma owner vfio/type1: close dma owner vfio/type1: track locked_vm per dma vfio/type1: update vaddr vfio: change dma owner vfio/type1: change dma owner drivers/vfio/container.c | 169 ++++++++++++++++++- drivers/vfio/vfio.h | 9 +- drivers/vfio/vfio_iommu_type1.c | 362 +++++++++++++++++++++++----------------- include/uapi/linux/vfio.h | 54 ++++-- 4 files changed, 412 insertions(+), 182 deletions(-) -- 1.8.3.1