Re: KVM PCI device assignment issues

Matthew Wilcox <matthew@xxxxxx> · Fri, 13 Feb 2009 10:36:29 -0700

On Fri, Feb 13, 2009 at 04:32:47PM +0000, Mark McLoughlin wrote:
> Hi,

You raise some interesting points.  Thanks for doing that rather than
going off and creating a big pile of patches and demanding they be
applied ;-)

> This gets confusing, so some background constraints first:
> 
>   - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
>     bridge must be assigned to the same VT-d domain - i.e given device 
>     A (0000:0f:1.0) and device B (and 0000:0f:2.0), if you assign 
>     device A to guest, you cannot then use device B in the host or 
>     another guest.

Is that a limitation of the VT-d / IOMMU setup?

>   - Some newer PCIe devices (and newer conventional PCI devices too via 
>     PCI Advanced Features) support Function Level Reset (FLR). This 
>     allows a PCI function to be reset without affecting any other 
>     functions on that device, or any other devices. This feature is not 
>     widespread yet AFAIK - e.g. I've seen it on an audio controller, 
>     and it must also be supported by SR-IOV devices.

Yes, that's definitely not very widespread yet.  OTOH, we don't need to
worry about disturbing other functions if all devices behind the same
bridge have to be mapped to the same guest.

>   - Secondary Bus Reset (SBR) allows software to trigger a reset on all 
>     devices (and functions) behind a PCI bridge.
> 
>   - A PCI Power Management D-state transition (D3hot to D0) can be used 
>     to reset a device (all functions).

That's not guaranteed according to PCI PM 1.2:

5.4.1. Software Accessible D3 (D3hot)

  When programmed to D0, the function may return to the D0 Initialized
  or D0 Uninitialized state without PCI RST# being asserted. This option
  is determined at design time and allows designs the option of either
  performing an internal reset or not performing an internal reset.

-----

There's also the option that devices in a hotplug PCI slot can have
their power cycled, forcing them into D3cold and then transitioning into
D0 Uninitialised.

>   - Some PCI devices don't have page aligned MMIO BARs. These devices 
>     (all functions) cannot be safely assigned to guests.

We've seen patches to force page alignment on this list ... they haven't
been sufficiently beautiful to be applied yet.

> Driver Unbinding
> ================
> 
> Before a device is assigned to a guest, we should make sure that no host
> device driver is currently bound to the device.
> 
> We can do that with e.g.
> 
>  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
>  $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
>  $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> 
> One minor problem with this scheme is that at this point you can't
> unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
> In order to support that, we need a "remove_id" interface to remove the
> dynamic ID.

It sounds like you'd be OK with a 'remove_id' interface that only
removes subsequently-added interfaces.

I might suggest a second approach which would be to have an explicit
echo to the bind file ignore the list of ids.  Then you wouldn't need to
'echo -n "8086 10de"' to begin with.

> Device Reset
> ============
> 
> Before assigning a device to a guest, it should be reset. The host or a
> previous guest may have left the device in an unknown state. Not
> resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors
> with e1000e devices.

Really, this is the same problem that kexec has.  Either the driver is
doing insufficient initialisation, or it's not doing tis shutdown
properly.  The former is definitely better than the latter as kexec may
be used from a position of having the driver locked solid and unable to
reset the device.

> If we're assigning devices from behind a PCI/PCI-x bridge (remember all
> devices must be assigned together), then we can use SBR to reset them
> all together. Clearly, though, one should make sure that all devices
> behind that bridge are not in use before doing the reset. We could
> implement this with a "reset" sysfs interface for pci-stub - it would
> only reset a device using SBR if all devices behind that bridge were
> bound to pci-stub.

I don't think this should be part of pci-stub, but rather part of the
PCI core.  I can imagine other uses for being able to reset all devices
behind a bridge that don't involve anything to do with v12n.  So I'd
like to see a /sys/class/pci_bus/*/reset (where * would not include root
busses).

> Where a conventional PCI device is on the root bus, or where a PCIe
> device is on the root bus or another bus with multiple devices, we could
> use the D-state transition reset. Since this resets all functions on a
> device, we would need a similar approach where all functions must be
> bound to pci-stub before being reset.

Even with the caveat above about D0 -> D3hot -> D0 doesn't necessarily
do a full reset, it does seem to be per-function.  For example, this
passage from the SCSI LSI 1010-66 chip docs implies that very strongly:

  Power state D3 is a lower power level than power state D2. A function in
  this state places the LSI53C1010 core in the coma mode. Furthermore,
  the function's soft reset is continually asserted while in power state D3,
  which clears all pending interrupts and 3-states the SCSI bus. In
  addition, the function's PCI Command register is cleared. If both of the
  LSI53C1010 functions are placed in power state D3, the Clock
  Quadrupler is disabled, which results in additional power savings.

> Filtering
> =========
> 
> In order to support a sane user interface in management tools, it should
> be possible to list all PCI devices on available on a host and filter
> out those which cannot be assigned to a guest.

I think this is going to have to have a large userspace component ...
let's see what needs to be done in-kernel.

> Furthermore, it should be possible to do this without actually affecting
> any of the devices - i.e. a "try to unbind and see if we oops" approach
> clearly isn't great.

Well, yes.  I'd even be upset if my network or storage flickered away
briefly while another using was starting to run KVM.

> This last constraint is the most difficult and points to the logic
> needing to be in userland management libraries. Possibly the only sane
> kernel space support would be "try to unbind and reset; if it works then
> the device is assignable".

If we expose a 'reset' file in the /sys/bus/pci/devices/*/ directories
for devices that are resettable, that should be enough, I would think.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html