On Wed, Nov 29, 2017 at 03:38:30PM +1100, Alexey Kardashevskiy wrote: > On 29/11/17 14:57, David Gibson wrote: > > On Wed, Nov 29, 2017 at 02:25:43PM +1100, Alexey Kardashevskiy wrote: > >> On 22/11/17 17:51, David Gibson wrote: > >>> On Tue, Nov 21, 2017 at 10:14:45PM -0700, Alex Williamson wrote: > >>>> On Wed, 22 Nov 2017 15:44:55 +1100 > >>>> David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> wrote: > >>>> > >>>>> On Tue, Nov 21, 2017 at 09:28:46PM -0700, Alex Williamson wrote: > >>>>>> On Wed, 22 Nov 2017 15:09:32 +1100 > >>>>>> Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > >>>>>> > >>>>>>> By default VFIO disables mapping of MSIX BAR to the userspace as > >>>>>>> the userspace may program it in a way allowing spurious interrupts; > >>>>>>> instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl. > >>>>>>> > >>>>>>> This works fine as long as the system page size equals to the MSIX > >>>>>>> alignment requirement which is 4KB. However with a bigger page size > >>>>>>> the existing code prohibits mapping non-MSIX parts of a page with MSIX > >>>>>>> structures so these parts have to be emulated via slow reads/writes on > >>>>>>> a VFIO device fd. If these emulated bits are accessed often, this has > >>>>>>> serious impact on performance. > >>>>>>> > >>>>>>> This adds an ioctl to the vfio-pci device which hides the sparse > >>>>>>> capability and allows the userspace to map a BAR with MSIX structures. > >>>>>> > >>>>>> So the user is in control of telling the kernel whether they're allowed > >>>>>> to mmap the msi-x vector table. That makes absolutely no sense. If > >>>>>> you're trying to figure out how userspace knows whether to implicitly > >>>>>> avoid mmap'ing the msix region, I think there are far better ways in > >>>>>> the existing region info ioctl. We could use a flag, or maybe the > >>>>>> existence of a capability chain pointer, or a new capability. But > >>>>>> absolutely not this. The kernel needs to decide whether it's going to > >>>>>> let the user do this, not the user. Thanks, > >>>>> > >>>>> No, it doesn't. This is actually the approach we discussed in Prague. > >>>>> > >>>>> Remember that intercepting access to the MSI-X table is not a host > >>>>> safety / security issue. It's just that without that we won't wire up > >>>>> the device's MSI-X vectors properly so they won't work. > >>>>> > >>>>> Basically the decision here is between > >>>>> > >>>>> A) Allow MSI-X configuration via standard PCI mechanisms, at the > >>>>> cost of making access slow for any registers sharing a page with > >>>>> the MSI-X table. > >>>>> > >>>>> or > >>>>> > >>>>> B) Make access to BAR registers sharing a page with the MSI-X table > >>>>> fast, at the cost of requiring some alternative mechanism to > >>>>> configure MSI-X vectors. > >>>>> > >>>>> And that is a tradeoff that it is reasonable for userspace to make. > >>>>> > >>>>> In the case of KVM guests, the decision depends entirely on the > >>>>> *guest* platform. Usually we need (A) because the guest expects to be > >>>>> able to poke the MSI-X table in the usual way. However for PAPR > >>>>> guests, there's an alternative mechanism via an RTAS call, which means > >>>>> we can use (B). > >>>>> > >>>>> The host kernel can't make this decision, because it doesn't know the > >>>>> guest platform (well, KVM might, but VFIO doesn't). > >>>>> > >>>>> A userspace VFIO program could also elect for (B) if it does care > >>>>> about performance of access to registers in the same BAR as the MSI-X > >>>>> table, but doesn't need MSI-X for example. > >>>> > >>>> You're asking for an ioctl to allow the kernel to allow the user to > >>>> mmap the page, when instead we could just allow the user to mmap the > >>>> page and whether the user does that and how they make use of it is up > >>>> to them... > >>> > >>> Duh. Sorry. For some reason I was thinking the magic MSI-X > >>> interception was happening in the host kernel rather than in qemu. > >>> > >>>> I understand that there are different virtualization techniques at play > >>>> here, it just doesn't seem relevant. In the case of (A), the user can > >>>> choose not to mmap the page overlapping the vector table even if the > >>>> kernel allows it. The user can also choose to mmap that page, but not > >>>> use the portion overlapping the vector table. QEMU already does this > >>>> by overlaying a MemoryRegion for vector table emulation. We might even > >>>> be able to get away with mmaping that page and emulating the vector > >>>> table elsewhere, which seems like the only option for a 64k page ARM > >>>> system. For (B), clearly it's just a nuisance that we can't currently > >>>> mmap this page, but I still don't see how the user allowing the kernel > >>>> to allow the user to mmap that page makes any sense. I can't even > >>>> describe it without it sounding ridiculous. Thanks, > >>> > >>> Right. Rethinking.. it seems to me we should just completely remove > >>> the logic from the kernel banning mmap()s overlapping the MSI-X > >>> table. All it does is poorly attempt to stop the user shooting > >>> themselves in the foot. > >>> > >>> Then we just need logic in qemu to avoid doing the overlapping memory > >>> region nonsense on a per-machine basis > >> > >> > >> So is there still any plan or we just ditch the feature? I am confused now. > > > > The plan is what I said above. Remove the bogus check logic from the > > kernel, then solve within qemu, by not creating the MSI-X intercept > > region for pseries guests. > > > There were 2 proposals how to do that. Both included platform code to > decide whether to allow mapping or not and some transport to pass that > enablement flag from the plafform code to the VFIO-PCI driver, one was via > an IOMMU group flag, the other via a PCI bus flag. Neither was accepted so > reposting those won't make any progress, what do I miss here? Was there any > agreement on how to do this? As we've discussed the filtering of mmap()able regions in the kernel is not a security concern. All it accomplishes is poorly trying to stop userspace from shooting itself in the foot by directly mapping and accessing the MSI-X table instead of using the VFIO interfaces to set up MSI vectors. We should simply remove - unconditionally - the checks in the kernel which prevent parts of the BARs from being mapped. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature