Re: [RFC PATCH kernel] vfio-pci: Allow mapping MSIX BAR

Alexey Kardashevskiy <aik@xxxxxxxxx> · Wed, 29 Nov 2017 18:58:11 +1100

On 29/11/17 16:17, David Gibson wrote:
> On Wed, Nov 29, 2017 at 03:38:30PM +1100, Alexey Kardashevskiy wrote:
>> On 29/11/17 14:57, David Gibson wrote:
>>> On Wed, Nov 29, 2017 at 02:25:43PM +1100, Alexey Kardashevskiy wrote:
>>>> On 22/11/17 17:51, David Gibson wrote:
>>>>> On Tue, Nov 21, 2017 at 10:14:45PM -0700, Alex Williamson wrote:
>>>>>> On Wed, 22 Nov 2017 15:44:55 +1100
>>>>>> David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>>> On Tue, Nov 21, 2017 at 09:28:46PM -0700, Alex Williamson wrote:
>>>>>>>> On Wed, 22 Nov 2017 15:09:32 +1100
>>>>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:
>>>>>>>>   
>>>>>>>>> By default VFIO disables mapping of MSIX BAR to the userspace as
>>>>>>>>> the userspace may program it in a way allowing spurious interrupts;
>>>>>>>>> instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
>>>>>>>>>
>>>>>>>>> This works fine as long as the system page size equals to the MSIX
>>>>>>>>> alignment requirement which is 4KB. However with a bigger page size
>>>>>>>>> the existing code prohibits mapping non-MSIX parts of a page with MSIX
>>>>>>>>> structures so these parts have to be emulated via slow reads/writes on
>>>>>>>>> a VFIO device fd. If these emulated bits are accessed often, this has
>>>>>>>>> serious impact on performance.
>>>>>>>>>
>>>>>>>>> This adds an ioctl to the vfio-pci device which hides the sparse
>>>>>>>>> capability and allows the userspace to map a BAR with MSIX structures.  
>>>>>>>>
>>>>>>>> So the user is in control of telling the kernel whether they're allowed
>>>>>>>> to mmap the msi-x vector table.  That makes absolutely no sense.  If
>>>>>>>> you're trying to figure out how userspace knows whether to implicitly
>>>>>>>> avoid mmap'ing the msix region, I think there are far better ways in
>>>>>>>> the existing region info ioctl.  We could use a flag, or maybe the
>>>>>>>> existence of a capability chain pointer, or a new capability.  But
>>>>>>>> absolutely not this.  The kernel needs to decide whether it's going to
>>>>>>>> let the user do this, not the user.  Thanks,  
>>>>>>>
>>>>>>> No, it doesn't.  This is actually the approach we discussed in Prague.
>>>>>>>
>>>>>>> Remember that intercepting access to the MSI-X table is not a host
>>>>>>> safety / security issue.  It's just that without that we won't wire up
>>>>>>> the device's MSI-X vectors properly so they won't work.
>>>>>>>
>>>>>>> Basically the decision here is between
>>>>>>>
>>>>>>>    A) Allow MSI-X configuration via standard PCI mechanisms, at the
>>>>>>>       cost of making access slow for any registers sharing a page with
>>>>>>>       the MSI-X table.
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>>    B) Make access to BAR registers sharing a page with the MSI-X table
>>>>>>>       fast, at the cost of requiring some alternative mechanism to
>>>>>>>       configure MSI-X vectors.
>>>>>>>
>>>>>>> And that is a tradeoff that it is reasonable for userspace to make.
>>>>>>>
>>>>>>> In the case of KVM guests, the decision depends entirely on the
>>>>>>> *guest* platform.  Usually we need (A) because the guest expects to be
>>>>>>> able to poke the MSI-X table in the usual way.  However for PAPR
>>>>>>> guests, there's an alternative mechanism via an RTAS call, which means
>>>>>>> we can use (B).
>>>>>>>
>>>>>>> The host kernel can't make this decision, because it doesn't know the
>>>>>>> guest platform (well, KVM might, but VFIO doesn't).
>>>>>>>
>>>>>>> A userspace VFIO program could also elect for (B) if it does care
>>>>>>> about performance of access to registers in the same BAR as the MSI-X
>>>>>>> table, but doesn't need MSI-X for example.
>>>>>>
>>>>>> You're asking for an ioctl to allow the kernel to allow the user to
>>>>>> mmap the page, when instead we could just allow the user to mmap the
>>>>>> page and whether the user does that and how they make use of it is up
>>>>>> to them...
>>>>>
>>>>> Duh.  Sorry.  For some reason I was thinking the magic MSI-X
>>>>> interception was happening in the host kernel rather than in qemu.
>>>>>
>>>>>> I understand that there are different virtualization techniques at play
>>>>>> here, it just doesn't seem relevant.  In the case of (A), the user can
>>>>>> choose not to mmap the page overlapping the vector table even if the
>>>>>> kernel allows it.  The user can also choose to mmap that page, but not
>>>>>> use the portion overlapping the vector table.  QEMU already does this
>>>>>> by overlaying a MemoryRegion for vector table emulation.  We might even
>>>>>> be able to get away with mmaping that page and emulating the vector
>>>>>> table elsewhere, which seems like the only option for a 64k page ARM
>>>>>> system.  For (B), clearly it's just a nuisance that we can't currently
>>>>>> mmap this page, but I still don't see how the user allowing the kernel
>>>>>> to allow the user to mmap that page makes any sense.  I can't even
>>>>>> describe it without it sounding ridiculous.  Thanks,
>>>>>
>>>>> Right.  Rethinking..  it seems to me we should just completely remove
>>>>> the logic from the kernel banning mmap()s overlapping the MSI-X
>>>>> table.  All it does is poorly attempt to stop the user shooting
>>>>> themselves in the foot.
>>>>>
>>>>> Then we just need logic in qemu to avoid doing the overlapping memory
>>>>> region nonsense on a per-machine basis
>>>>
>>>>
>>>> So is there still any plan or we just ditch the feature? I am confused now.
>>>
>>> The plan is what I said above.  Remove the bogus check logic from the
>>> kernel, then solve within qemu, by not creating the MSI-X intercept
>>> region for pseries guests.
>>
>>
>> There were 2 proposals how to do that. Both included platform code to
>> decide whether to allow mapping or not  and  some transport to pass that
>> enablement flag from the plafform code to the VFIO-PCI driver, one was via
>> an IOMMU group flag, the other via a PCI bus flag. Neither was accepted so
>> reposting those won't make any progress, what do I miss here? Was there any
>> agreement on how to do this?
> 
> As we've discussed the filtering of mmap()able regions in the kernel
> is not a security concern.  All it accomplishes is poorly trying to
> stop userspace from shooting itself in the foot by directly mapping
> and accessing the MSI-X table instead of using the VFIO interfaces to
> set up MSI vectors.
> 
> We should simply remove - unconditionally - the checks in the kernel
> which prevent parts of the BARs from being mapped.
> 


Just like this? And Alex is ok with that?

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 9ed1ecf..106b825 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1146,22 +1146,6 @@ static int vfio_pci_mmap(void *device_data, struct
vm_area_struct *vma)
        if (req_start + req_len > phys_len)
                return -EINVAL;

-       if (!vdev->msix_mmap && index == vdev->msix_bar) {
-               /*
-                * Disallow mmaps overlapping the MSI-X table; users don't
-                * get to touch this directly.  We could find somewhere
-                * else to map the overlap, but page granularity is only
-                * a recommendation, not a requirement, so the user needs
-                * to know which bits are real.  Requiring them to mmap
-                * around the table makes that clear.
-                */
-
-               /* If neither entirely above nor below, then it overlaps */
-               if (!(req_start >= vdev->msix_offset + vdev->msix_size ||
-                     req_start + req_len <= vdev->msix_offset))
-                       return -EINVAL;
-       }




-- 
Alexey

Attachment:
signature.asc

Description: OpenPGP digital signature