On 29/11/17 16:17, David Gibson wrote: > On Wed, Nov 29, 2017 at 03:38:30PM +1100, Alexey Kardashevskiy wrote: >> On 29/11/17 14:57, David Gibson wrote: >>> On Wed, Nov 29, 2017 at 02:25:43PM +1100, Alexey Kardashevskiy wrote: >>>> On 22/11/17 17:51, David Gibson wrote: >>>>> On Tue, Nov 21, 2017 at 10:14:45PM -0700, Alex Williamson wrote: >>>>>> On Wed, 22 Nov 2017 15:44:55 +1100 >>>>>> David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>>> On Tue, Nov 21, 2017 at 09:28:46PM -0700, Alex Williamson wrote: >>>>>>>> On Wed, 22 Nov 2017 15:09:32 +1100 >>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>>>>>>> >>>>>>>>> By default VFIO disables mapping of MSIX BAR to the userspace as >>>>>>>>> the userspace may program it in a way allowing spurious interrupts; >>>>>>>>> instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl. >>>>>>>>> >>>>>>>>> This works fine as long as the system page size equals to the MSIX >>>>>>>>> alignment requirement which is 4KB. However with a bigger page size >>>>>>>>> the existing code prohibits mapping non-MSIX parts of a page with MSIX >>>>>>>>> structures so these parts have to be emulated via slow reads/writes on >>>>>>>>> a VFIO device fd. If these emulated bits are accessed often, this has >>>>>>>>> serious impact on performance. >>>>>>>>> >>>>>>>>> This adds an ioctl to the vfio-pci device which hides the sparse >>>>>>>>> capability and allows the userspace to map a BAR with MSIX structures. >>>>>>>> >>>>>>>> So the user is in control of telling the kernel whether they're allowed >>>>>>>> to mmap the msi-x vector table. That makes absolutely no sense. If >>>>>>>> you're trying to figure out how userspace knows whether to implicitly >>>>>>>> avoid mmap'ing the msix region, I think there are far better ways in >>>>>>>> the existing region info ioctl. We could use a flag, or maybe the >>>>>>>> existence of a capability chain pointer, or a new capability. But >>>>>>>> absolutely not this. The kernel needs to decide whether it's going to >>>>>>>> let the user do this, not the user. Thanks, >>>>>>> >>>>>>> No, it doesn't. This is actually the approach we discussed in Prague. >>>>>>> >>>>>>> Remember that intercepting access to the MSI-X table is not a host >>>>>>> safety / security issue. It's just that without that we won't wire up >>>>>>> the device's MSI-X vectors properly so they won't work. >>>>>>> >>>>>>> Basically the decision here is between >>>>>>> >>>>>>> A) Allow MSI-X configuration via standard PCI mechanisms, at the >>>>>>> cost of making access slow for any registers sharing a page with >>>>>>> the MSI-X table. >>>>>>> >>>>>>> or >>>>>>> >>>>>>> B) Make access to BAR registers sharing a page with the MSI-X table >>>>>>> fast, at the cost of requiring some alternative mechanism to >>>>>>> configure MSI-X vectors. >>>>>>> >>>>>>> And that is a tradeoff that it is reasonable for userspace to make. >>>>>>> >>>>>>> In the case of KVM guests, the decision depends entirely on the >>>>>>> *guest* platform. Usually we need (A) because the guest expects to be >>>>>>> able to poke the MSI-X table in the usual way. However for PAPR >>>>>>> guests, there's an alternative mechanism via an RTAS call, which means >>>>>>> we can use (B). >>>>>>> >>>>>>> The host kernel can't make this decision, because it doesn't know the >>>>>>> guest platform (well, KVM might, but VFIO doesn't). >>>>>>> >>>>>>> A userspace VFIO program could also elect for (B) if it does care >>>>>>> about performance of access to registers in the same BAR as the MSI-X >>>>>>> table, but doesn't need MSI-X for example. >>>>>> >>>>>> You're asking for an ioctl to allow the kernel to allow the user to >>>>>> mmap the page, when instead we could just allow the user to mmap the >>>>>> page and whether the user does that and how they make use of it is up >>>>>> to them... >>>>> >>>>> Duh. Sorry. For some reason I was thinking the magic MSI-X >>>>> interception was happening in the host kernel rather than in qemu. >>>>> >>>>>> I understand that there are different virtualization techniques at play >>>>>> here, it just doesn't seem relevant. In the case of (A), the user can >>>>>> choose not to mmap the page overlapping the vector table even if the >>>>>> kernel allows it. The user can also choose to mmap that page, but not >>>>>> use the portion overlapping the vector table. QEMU already does this >>>>>> by overlaying a MemoryRegion for vector table emulation. We might even >>>>>> be able to get away with mmaping that page and emulating the vector >>>>>> table elsewhere, which seems like the only option for a 64k page ARM >>>>>> system. For (B), clearly it's just a nuisance that we can't currently >>>>>> mmap this page, but I still don't see how the user allowing the kernel >>>>>> to allow the user to mmap that page makes any sense. I can't even >>>>>> describe it without it sounding ridiculous. Thanks, >>>>> >>>>> Right. Rethinking.. it seems to me we should just completely remove >>>>> the logic from the kernel banning mmap()s overlapping the MSI-X >>>>> table. All it does is poorly attempt to stop the user shooting >>>>> themselves in the foot. >>>>> >>>>> Then we just need logic in qemu to avoid doing the overlapping memory >>>>> region nonsense on a per-machine basis >>>> >>>> >>>> So is there still any plan or we just ditch the feature? I am confused now. >>> >>> The plan is what I said above. Remove the bogus check logic from the >>> kernel, then solve within qemu, by not creating the MSI-X intercept >>> region for pseries guests. >> >> >> There were 2 proposals how to do that. Both included platform code to >> decide whether to allow mapping or not and some transport to pass that >> enablement flag from the plafform code to the VFIO-PCI driver, one was via >> an IOMMU group flag, the other via a PCI bus flag. Neither was accepted so >> reposting those won't make any progress, what do I miss here? Was there any >> agreement on how to do this? > > As we've discussed the filtering of mmap()able regions in the kernel > is not a security concern. All it accomplishes is poorly trying to > stop userspace from shooting itself in the foot by directly mapping > and accessing the MSI-X table instead of using the VFIO interfaces to > set up MSI vectors. > > We should simply remove - unconditionally - the checks in the kernel > which prevent parts of the BARs from being mapped. > Just like this? And Alex is ok with that? diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 9ed1ecf..106b825 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1146,22 +1146,6 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) if (req_start + req_len > phys_len) return -EINVAL; - if (!vdev->msix_mmap && index == vdev->msix_bar) { - /* - * Disallow mmaps overlapping the MSI-X table; users don't - * get to touch this directly. We could find somewhere - * else to map the overlap, but page granularity is only - * a recommendation, not a requirement, so the user needs - * to know which bits are real. Requiring them to mmap - * around the table makes that clear. - */ - - /* If neither entirely above nor below, then it overlaps */ - if (!(req_start >= vdev->msix_offset + vdev->msix_size || - req_start + req_len <= vdev->msix_offset)) - return -EINVAL; - } -- Alexey
Attachment:
signature.asc
Description: OpenPGP digital signature