On Fri, 30 Jun 2017 00:50:57 +1000 Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > At the moment VFIO does not allow mapping MSIX BAR to the userspace > and provides information about what can be mapped via a PCI region > capability. This ensures that the userspace does not get access > to MSIX vector table and cannot break MSIX setup. For most platforms > the userspace also chooses not to map MSIX as it has to emulate it for > fully virtualized guests. > > This creates a problem when an MSIX BAR is not aligned to a system > page size and in order to prevent mapping, chunks before and/or after > the MSIX BAR can't also be mapped to the userspace. If these chunks > happen to be used very often (for example, Emulex LightPulse FC), > we see half of expected performance. > > This enables mapping of MSIX BARs to the userspace by advertising > these BARs as a single chunk (as any other BAR). The userspace > still is expected not to mmap an MSIX BAR if it want to emulate it > (QEMU still has this code). > > This does not break protection as at the moment for the Type1 IOMMU, > vfio_iommu_type1_attach_group() hook fails if MSIX remapping > is not enabled and this ensures that even ill-behaved device cannot > harm the host. For sPAPR IOMMU, the hardware always provides such > isolation by default. The only case when such mapping protection > is rather needed is allow_unsafe_interrupts but as its name suggests, > it is not safe anyway. I don't think allow_unsafe_interrupts is sufficiently un-safe to entirely disregard it like this either. It's currently an opt-in, but we don't consider it unsafe enough that we taint the kernel as a result of using it. Allowing user write access to the MSI-X vector table when there's no IOMMU protection upgrades an MSI injection attack from somewhat involved to almost trivially easy. So nak to that. > Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > > --- > > This is to replace 6 patches from Yongji > "[RESEND PATCH v2 0/6] vfio-pci: Add support for mmapping MSI-X table" > > Previously we tried to add a MSI_REMAP flag to a PCI bus and set it > via various means for various platforms. It relied on > MSI_FLAG_IRQ_REMAPPING from msi_domain_info which never made it to > the upstream. Instead Type1 IOMMU learned to check remapping > capabilities and refuse to work and it seems like a good idea to use > this. Type1 has always known how to do this, but interrupt remapping is a property on the IOMMU, not the bus, which is why Yongji was trying to create a PCI bus property to expose to the device. > How much do we care about allow_unsafe_interrupts? If we care enough, > I'll need a way to pass the msi_remap flag from > vfio_iommu_type1_attach_group() down to vfio_pci_bar_rw/vfio_pci_mmap, > it does not seem easy to make it nice. As above, I think you're abusing an opt-in security concern as an excuse to allow an almost trivially easy exploit. I also prefer Yongji's approach of exposing the capability through bus properties rather than creating a back channel between vfio modules as you seem to suggest. > What else did I miss? > > Please comment. Thanks. > --- > drivers/vfio/pci/vfio_pci.c | 41 ++++------------------------------------ > drivers/vfio/pci/vfio_pci_rdwr.c | 5 ----- > 2 files changed, 4 insertions(+), 42 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index 324c52e3a1a4..222bc07b9e41 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -559,15 +559,8 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev, > struct vfio_info_cap *caps) > { > struct vfio_region_info_cap_sparse_mmap *sparse; > - size_t end, size; > - int nr_areas = 2, i = 0, ret; > - > - end = pci_resource_len(vdev->pdev, vdev->msix_bar); > - > - /* If MSI-X table is aligned to the start or end, only one area */ > - if (((vdev->msix_offset & PAGE_MASK) == 0) || > - (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end)) > - nr_areas = 1; > + size_t size; > + int nr_areas = 1, i = 0, ret; > > size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas)); > > @@ -577,18 +570,8 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev, > > sparse->nr_areas = nr_areas; > > - if (vdev->msix_offset & PAGE_MASK) { > - sparse->areas[i].offset = 0; > - sparse->areas[i].size = vdev->msix_offset & PAGE_MASK; > - i++; > - } > - > - if (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) < end) { > - sparse->areas[i].offset = PAGE_ALIGN(vdev->msix_offset + > - vdev->msix_size); > - sparse->areas[i].size = end - sparse->areas[i].offset; > - i++; > - } > + sparse->areas[i].offset = 0; > + sparse->areas[i].size = pci_resource_len(vdev->pdev, vdev->msix_bar); > > ret = vfio_info_add_capability(caps, VFIO_REGION_INFO_CAP_SPARSE_MMAP, > sparse); This doesn't make any sense to me, why even expose it as a sparse mmap capability if there's nothing sparse about it? > @@ -1115,22 +1098,6 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > if (req_start + req_len > phys_len) > return -EINVAL; > > - if (index == vdev->msix_bar) { > - /* > - * Disallow mmaps overlapping the MSI-X table; users don't > - * get to touch this directly. We could find somewhere > - * else to map the overlap, but page granularity is only > - * a recommendation, not a requirement, so the user needs > - * to know which bits are real. Requiring them to mmap > - * around the table makes that clear. > - */ > - > - /* If neither entirely above nor below, then it overlaps */ > - if (!(req_start >= vdev->msix_offset + vdev->msix_size || > - req_start + req_len <= vdev->msix_offset)) > - return -EINVAL; > - } > - > /* > * Even though we don't make use of the barmap for the mmap, > * we need to request the region and the barmap tracks that. > diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c > index 357243d76f10..916b72c5c4fc 100644 > --- a/drivers/vfio/pci/vfio_pci_rdwr.c > +++ b/drivers/vfio/pci/vfio_pci_rdwr.c > @@ -164,11 +164,6 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf, > } else > io = vdev->barmap[bar]; > > - if (bar == vdev->msix_bar) { > - x_start = vdev->msix_offset; > - x_end = vdev->msix_offset + vdev->msix_size; > - } > - > done = do_io_rw(io, buf, pos, count, x_start, x_end, iswrite); > > if (done >= 0) Nope, I prefer Yongji's approach, sorry. Thanks, Alex