Hello,
Thank you for getting back to me so soon. I switched to Thunderbird for
better text clarity. See some comment inline:
On 21.03.2018 19:54, Laine Stump wrote:
On 03/21/2018 11:46 AM, Ciprian Barbu wrote:
Hello,
In the context of running Openstack on a cluster of Cavium ThunderX cn8890 aarch64 servers, we are trying to attach virtual functions to a VM.
First some introduction. This Cavium SoC has a different approach to Virtual Functions than on x86 NICs, in which VFs are always enabled and there are two types of VFs and *one single* PF, as follows:
- primary VFs - these are in fact assigned by the system to the physical ports of the server, e.g em2p1s0f1, em2p1s0f3 etc below.
- secondary VFs - the main purpose of these is to provide additional HW queues under SW control (usually DPDK applications) by automatically binding them to the needed physical port.
- one single "physical" function, device 0002:01:00.0 below, which to the best of my knowledge acts merely as a stub and cannot be assigned an interface name.
Below is the output of "dpdk-devbind.py -s" which provides some useful information.
Network devices using DPDK-compatible driver ============================================
0002:01:00.2 'Device a034' drv=vfio-pci unused=nicvf
Network devices using kernel driver
===================================
0000:01:10.0 'THUNDERX BGX (Common Ethernet Interface)' if= drv=thunder-BGX unused=thunder_bgx,vfio-pci
0000:01:10.1 'THUNDERX BGX (Common Ethernet Interface)' if= drv=thunder-BGX unused=thunder_bgx,vfio-pci
0002:01:00.0 'THUNDERX Network Interface Controller' if= drv=thunder-nic unused=nicpf,vfio-pci
0002:01:00.1 'Device a034' if=em2p1s0f1 drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:00.3 'Device a034' if=em2p1s0f3 drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:00.4 'Device a034' if=em2p1s0f4 drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:00.5 'Device a034' if=em2p1s0f5 drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:00.6 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:00.7 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:01.0 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
Now for the problem. I don't have a domain definition because libvirt fails to start a domain, but I might be able to find what nova generates. But what it tries to do is passthrough em2p1s0f3, address 0002:01:00.3:
<interface type='hostdev' managed='yes'>
<source>
<address type='pci' domain='0x0002' bus='0x1' slot='0x0' function='0x3'/>
</source>
</interface>
I see that while I was typing my own "really long" message, that Alex
pointed out in a response that you could use <hostdev> rather than
<interface type='hostdev'> if you don't need to configure the MAC
address or vlan tag of the VF from within libvirt. If that's the case,
you can ignore the rest of my message, but otherwise read on :-)
Due to some Openstack technicalities, it's not possible or reliable to
do so on this SoC. There are 2 ways to achieve this, if you are
interested to read:
1. "blind" PCI passthrough [1], where it's possible to request any
number of PCI devices of certain vendor_id:product_id. You cannot
specify which PCI buss address, so it's not flexible
2. using direct-physical bound ports, no good documentation except for
[2]. This doesn't work for Cavium ThunderX because the interface are
*always* Virtual functions.
I will test your suggestion though, through libvirt, I usually don't
manually start VMs, since it's Openstack.
You can find attached a trimmed libvirtd.log where the main error is:
43236: error : virPCIGetVirtualFunctionInfo:2927 : internal error: The PF device for VF /sys/bus/pci/devices/0002:01:00.3 has no network device name
I have actually spent a few days trying to do some hacks and learn some more. The main idea is that virPCIGetVirtualFunctionInfo fails to find the physical name for the virtual device at address 0002:01:00.3, which as I explained in the introduction is something that this Cavium SoC does not do.
Looking further down the stream, almost all of the helper functions need a linkdev for the physical function, which means that making libvirt work on this system means some heavy refactoring, a solution being to use the sysfs path rather than the interface name.
The PF netdev name is needed because the netlink messages to get/set the
VF MAC address and vlan tag are sent to the PF netdev. A message to set
the MAC and vlan tag for VF 2 of PF "enpblah' would be something like this:
RTM_SETLINK/NLM_F_REQUEST-------+
| ifindex=-1 |
| family=AF_UNSPEC |
| IFLA_IFNAME------------------+|
| | enpblah ||
| +----------------------------+|
| IFLA_VFINFO_LIST-------------+|
| | IFLA_VFINFO---------------+||
| | | IFLA_VF_MAC------------+|||
| | | | vf=2 ||||
| | | | mac=de:ad:be:ef:c0:55||||
| | | +----------------------+|||
| | | IFLA_VF_VLAN-----------+|||
| | | | vf=2 ||||
| | | | vlanid=42 ||||
| | | +----------------------+|||
| | +------------------------+|||
| +---------------------------+||
+-------------------------------+
I *think* (although I can't say for certain since the original code was
written by someone else, and I've never tried it the other way) that we
could achieve the same result by filling in ifindex with the index of
"enpblah" (instead of -1), then leaving out the IFLA_IFNAME attribute,
but I haven't found any way of specifying the target of a netlink
message other than with its ifindex or its ifname.
When you say "use the sysfs path", what exactly do you mean? Is there a
way to save/set the VF MAC addresses and vlan tags via sysfs? Or
(better) a way to address the netlink message to the PF if it has no
netdev name or ifindex? Maybe the drivers are setup so that an
RTM_SETLINK request send to a "primary VF" would be able to get/set
VF_INFO for "Secondary VFs" associated with the same PF? I'm just
pulling ideas out of thin air here...
What I meant was that functions like virNetDevGetVirtualFunctionIndex,
or just virNetDevSaveNetConfig, which require the physical linkdev name,
it should be possible to pass the sysfspath instead. But looking again
in virHostdevPreparePCIDevices it looks like there are many places where
netlink is used. So forget about this idea, it doesn't look that feasible.
This will not work 100% from what I've seen, at least virNetDevGetVfConfig uses netlink to save the admin MAC (part of virNetDevSaveNetConfig), and netlink needs the ifname.
So I'm quite stuck on finding a workaround/fix for this platform which would potentially be something upstreamable, so that we, ENEA, don't burden with maintaining an ugly hack. Right now we are using libvirt 3.5.0 but we can upgrade to something newer if need.
The question(s) thus, are
1. is this problem known in the libvirt community?
This is the first time I've heard of an SRIOV network device where the
PF wasn't bound to a netdev driver and so had no netdev name or ifindex.
I guess this is describing the card you're talking about?
https://dpdk.org/doc/guides/nics/thunderx.html
Yes, this is kind of the only public documentation about ThunderX NICs.
But do note that the interfaces are integrated on the motherboard, this
networking SoC has many HW accelerators and assignable HW resources, HW
queues, VFs, buffer management etc. And all these blocks are connected
to the SoC via PCIe, but not using slots, it's actually integrated on
the motherboard. See this for example [3]. There is more documentation
available on request through support accounts I think.
I have to say that it does *not* give me the warm fuzzies that it
apparently requires setting
/sys/module/vfio/parameters/enable_unsafe_noiommu_mode=1 in order to
work (or did I misunderstand that part).
It's needed inside the VM at least, to be able to assign vfio-pci to the
device, which is needed if you want to run a DPDK application in the
guest, on the passed-through interface. It might be needed to do the
same on the host, but I'm not sure, but yes, it looks a bit scary. There
is probably a good explanation for needing this.
2. Is there any plan to make it work?
If the hardware exists, and if users need to be able to set each VF's
MAC address and vlan tag via libvirt config, then we (the royal Open
Source "we" :-) need to make it work somehow.
I was hoping for more awareness about this problem, ThunderX has been
available for some time. Our usecase with OPNFV/Openstack is just one of
many possible, where we don't control what libvirt does, not directly.
Probably others will pass the device as a hostdev like you and Alex
suggested.
Since you mentioned this option, we might be able to hack Openstack Nova
to treat these particular devices as PFs, although they look like VFs in
the system, but we might be opening another can of worms this way.
3. Can you give some pointers on an approach to adapt libvirt to this system?
4. Maybe it's worth changing the kernel to assign a sort of dummy interface to the physical function?
If there is no other way to address a netlink message to the PF telling
it to set the MAC address and vlan tag of a VF, then that may be needed.
If it can be saved/set in some other *standard* way, then perhaps
libvirt can grow support for it.
I guess this will come naturally if some critical mass of users is achieved.
Hacking the kernel to show a dummy interface might not work, there is
one single PF for all VFs, so one MAC address only.
Thanks and sorry for the long email,
Long emails with actual information are always preferable to an endless
chain of short mails that reveal the situation in tiny bits and pieces :-)
Great, I hope it will also be productive. I hope to find some nice
workaround, but I still found it useful to point out this problem and
see what is the general consensus on what to do.
[1]
https://trickycloud.wordpress.com/2016/03/28/openstack-for-nfv-applications-sr-iov-and-pci-passthrough/
[2]
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/networking_guide/sr-iov-support-for-virtual-networking
[3]
https://www.avantek.co.uk/store/avantek-96-core-cavium-thunderx-arm-server-r270-t61.html
BR,
/Ciprian
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list