Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Sun, 8 Apr 2012 23:35:48 +0300

On Sun, Apr 08, 2012 at 08:39:35PM +0200, Jan Kiszka wrote:
> On 2012-04-08 20:18, Michael S. Tsirkin wrote:
> > On Sun, Apr 08, 2012 at 07:37:57PM +0200, Jan Kiszka wrote:
> >> On 2012-04-08 18:08, Avi Kivity wrote:
> >>> On 04/08/2012 07:04 PM, Michael S. Tsirkin wrote:
> >>>> On Sun, Apr 08, 2012 at 06:50:27PM +0300, Avi Kivity wrote:
> >>>>> On 04/08/2012 06:46 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>
> >>>>>>>>> I'm thinking about this flow:
> >>>>>>>>>
> >>>>>>>>>   FLR the device
> >>>>>>>>>   for each emulated register
> >>>>>>>>>      read it from the hardware
> >>>>>>>>>      if different from emulated register:
> >>>>>>>>>         update the internal model (for example, disabling MSI in kvm if
> >>>>>>>>> needed)
> >>>>>>>>
> >>>>>>>> If we do it this way we get back the problem this patch
> >>>>>>>> is trying to solve: MSIX assigned while device
> >>>>>>>> memory is disabled would cause unsupported request errors.
> >>>>>>>
> >>>>>>> Why is that?  FLR would presumably disable MSI in the device, and this
> >>>>>>> line would disable it in kvm as well.
> >>>>>>
> >>>>>> The bug is that device memory is disabled (FLR would do that)
> >>>>>> while MSI is enabled in kvm. The fix is to
> >>>>>> disable MSI in kvm first.
> >>>>>
> >>>>> Yes, no need to repeat.  My question is whether my pseudo-code does the
> >>>>> same
> >>>>
> >>>> It doesn't seem to: FLR (disabling memory) is followed
> >>>> by MSI disable in kvm instead of the reverse.
> >>>
> >>> Ah, so the problem is the ordering?  I see.
> >>>
> >>>>> and whether or not if it is better (when applied to all emulated
> >>>>> config space).
> >>>>
> >>>> I'm not sure.
> >>>> I would like to see an example of a register that you have
> >>>> in mind.
> >>>
> >>> I went over the PCI registers and saw none that would be affected.
> >>>
> >>>>>>
> >>>>>> Yes. I'm talking about things like enabling memory, setting up irq register,
> >>>>>> etc though. Most of this setup is done by bios.
> >>>>>
> >>>>> I see.  So should we have a pci_reset_function() variant that limits
> >>>>> itself to restoring just those bits?
> >>>>
> >>>> We only need kernel to restore whatever qemu emulates, but
> >>>> kernel doesn't know what that is.
> >>>> What kind of interface do you have in mind?
> >>>>
> >>>
> >>> The same as pci_reset_function(), but leaves MSI clear.
> >>>
> >>> I guess it's not worth it if the ordering problem is there.
> >>
> >> The core problem is not the ordering. The problem is that the kernel is
> >> susceptible to ordering mistakes of userspace.
> >> And that is because the
> >> kernel panics on PCI errors of devices that are in user hands - a
> >> critical kernel bug IMHO.
> > 
> > I'm not sure. The pci sysfs interface
> > is by design not secured against malicious users,
> > isn't it?
> 
> That's surely true for devices outside of IOMMU protection. But do we
> really have to give up when we encapsulate and isolate them that way?
> Provided we moderate access to the sysfs resources via libvirt or some
> other management service.

We don't have to give up but we'd have to build such an
interface: /config attribute is not it.

> > 
> >> Proper reset of MSI or even the whole PCI
> >> config space is another issue, but one the kernel should not worry about
> >> - still, it should be fixed (therefore this patch).
> >> But even if we disallowed userland to disable MMIO and PIO access to the
> >> device, we would be be able to exclude that there are secrete channels
> >> in the device's interface having the same effect.
> > 
> > I'm not sure I agree here.  If there are secret channels to the device
> > that let it violate the PCI express spec, it can probably break the SRIOV
> > security model. And then you can do much more than just crash the host.
> 
> Maybe, but there are also other devices. And if a guest reprograms it
> (firmware update...) and makes it stop reacting on requests, we may get
> the same effect. That would also be some kind of a "secrete channel".

Right. So it looks like SRIOV VF is the only type of device that
is safe to assign to a guest:

Presumably, SRIOV VFs don't let driver program the firmware.
And I think SRIOV VFs don't have MMIO/PIO enable bits either,
and the BAR isn't programmable through the VF...

> > 
> >> So we likely need to
> >> enhance PCI error handling to catch and handle faults for certain
> >> devices differently - those we cannot trust to behave properly while
> >> they are under userland/guest control.
> >>
> >> Jan
> >>
> > 
> > 
> > I agree - forwarding errors to the guest would actually be very useful - but
> > I think we also need to analyse the problem carefully,
> > and prevent as many ways as we can for guest to cause trouble.
> 
> If possible, the protection should target userspace which would
> automatically include guests. Only if that is not feasible with
> reasonable effort, we have to rely on QEMU to save the host.

Defence in depth is best, right?

> > 
> > And there is another issue here: unsuppported request errors
> > should not cause kernel panics IMO.
> > 
> > There's also the issue that qemu let guest control the MMIO/PIO
> > bits in the command register.
> > 
> > So there are multiple bugs.
> > 
> 
> Yep, that's true.
> 
> Jan
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html