Re: [libvirt] SRIOV configuration

Laine Stump <lstump@xxxxxxxxxx> · Thu, 24 Sep 2020 23:32:18 -0400

Edward and I have had a multi-day private conversation in IRC on the 
topic of this mail. I was planning to update this thread with an email, 
but forgot until now :-/

On 9/24/20 10:54 AM, Daniel P. Berrangé wrote:
On Mon, Sep 21, 2020 at 06:04:36PM +0300, Edward Haas wrote:
The PCI addresses appearing on the domxml are not the same as the ones
mappend/detected in the VM itself. I compared the domxml on the host
and the lspci in the VM while the VM runs.
Can you clarify what you are comparing here ?

The PCI slot / function in the libvirt XML should match, but the "bus"
number in libvirt XML is just a index referencing the <controller>
element in the libvirt XML.  So the "bus" number won't directly match
what's reported in the guest OS. If you want to correlate, you need
to look at the <address> on the <controller> to translate the libvirt
"bus" number.

Right. The bus number that is visible in the guest is 100% controlled by 
the device firmware (and possibly the guest OS?), and there is no way 
for qemu to explicitly set it, and thus no way for libvirt to guarantee 
that the bus number in libvirt XML will be what is seen in the guest OS; 
the bus number in the XML only has meaning within the XML - you can find 
which controller a device is connected to by looking for the PCI 
controller that has the same "index" as the device's "bus".

This occurs only when SRIOV is defined, messing up also the other
"regular" vnics.
Somehow, everything comes up and runs (with the SRIOV interface as
well) on the first boot (even though the PCI addresses are not in
sync), but additional boots cause the VM to mess up the interfaces
(not all are detected).

Actually we looked at this offline, and the "messing up" that's 
occurring is not due to any change in PCI address from one boot to the 
next. The entire problem is caused by  the guest OS using traditional 
"eth0" and "eth1" netdev names, and making the incorrect assumption that 
those names are stable from one boot to the next. In fact, it is a 
long-known problem that, due to a race between kernel code initializing 
devices and user processes giving them names, the ordering of ethN 
device names can change from one boot to the next *even with completely 
identical hardware and no configuration changes. Here is a good 
description of that problem, and of systemd's solution to it 
("predictable network device names"):

https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/

Edward's inquiry was initiated by this bugzilla:

  https://bugzilla.redhat.com/show_bug.cgi?id=1874096

You can see in the "first boot" and "second boot" ifconfig output that 
the same ethernet device has the "altname" enp2s1, and the same device 
has the altname enp3s0 during both runs; these names are given by 
systemd's "predictable network device name" algorithm (which bases the 
netdev name on the PCI address of the device). But the race between 
kernel and userspace causes the "ethN" names to be assigned differently 
during one boot and the next.

In order to have predictable netdev names, the OS image needs to stop 
setting net.ifnames=0 on the kernel command line. If they like, they can 
give their own more descriptive names to the devices (methods arae 
described in the above systemd document), but they need to stop relying 
on ethN device names.

(note that this experience did uncover another bug in libvirt, which 
*might* contribute to the racy code flip flopping from boot to boot, but 
still isn't the root cause of the problem - in this case libvirtd is 
running privileged, but inside a container, and the container doesn't 
have full access to the devices' PCI config data in sysfs (you can see 
this when you run "lspci -v" inside the container, you'll notice 
"Capabilities: <access denied>". One result of this is that libvirt 
mistakenly determines the VF is a conventional PCI device (not PCIe), so 
it auto-adds a pcie-to-pci-bridge, and plugs the VF into that 
controller. I'm guessing that makes device initialization take slightly 
longer or something, changing the results of the race. I'm looking into 
changing the test for PCIe vs. conventional PCI, but again that isn't 
the real problem here)

This is how the domxml hostdev section looks like:
```
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0000' bus='0x3b' slot='0x0a' function='0x4'/>
       </source>
       <alias name='hostdev0'/>
       <address type='pci' domain='0x0000' bus='0x06' slot='0x01'
function='0x0'/>
     </hostdev>
```

Is there something we are missing or we misconfigured?
Tested with 6.0.0-16.fc31

My second question is: Can libvirt avoid accessing the PF (as we do
not need mac and other options).
I'm not sure, probably a question for Laine.

The entire point of <interface type='hostdev'> is to be able to set the 
MAC address (and optionally the vlan tag) of a VF when assigning it to a 
guest, and the only way to set those is via the PF. If you use plain 
<hostdev>, then libvirt has no idea that the device is a VF, so it 
doesn't look for or try to access its PF.

So, you're doing the right thing - since your container has no access to 
the PF, you need to set the MAC address / vlan tag outside the container 
(via the PF), and then use <hostdev> (which doesn't do anything related 
to PF devices).