Re: [PATCH] util: Add phys_port_name support on virPCIGetNetName

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/10/20 11:51 AM, Adrian Chiris wrote:

Hi,
Would be great to get a pair of eyes on this Patch,
Thanks!


I've looked at it several times and every time would just end up shaking my head wondering why there isn't one definitive symlink in the VF's sysfs for the netdev of the physical port. (Part of my lack of response from the last time is that I didn't know how to respond since I didn't understand why such roundabout logic should be needed to answer such a basic question, decided I should look at it again before responding, and then forgot about it :-()


Anyway, this time I'm determined to not get up until I've got it figured out (or at least understand exactly what I don't have figured out)...



-----Original Message-----
From: Dmytro Linkin <dlinkin@xxxxxxxxxx>
Sent: Tuesday, October 27, 2020 10:58 AM
To: libvir-list@xxxxxxxxxx
Cc: Laine Stump <laine@xxxxxxxxxx>; Adrian Chiris <adrianc@xxxxxxxxxx>;
Moshe Levi <moshele@xxxxxxxxxx>
Subject: Re: [PATCH] util: Add phys_port_name support on
virPCIGetNetName

On Mon, Sep 28, 2020 at 12:56:12PM +0300, Dmytro Linkin wrote:
On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote:
On 8/28/20 6:53 AM, Dmytro Linkin wrote:
Current virPCIGetNetName() logic is to get net device name by
checking it's phys_port_id, if caller provide it, or by it's index
(eg, by it's position at sysfs net directory). This approach worked
fine up until linux kernel version 5.8, where NVIDIA Mellanox
driver implemented linking of VFs' representors to PCI device in
switchdev mode. This mean that device's sysfs net directory will hold
multiple net devices. Ex.:
$ ls '/sys/bus/pci/devices/0000:82:00.0/net'
ens1f0  eth0  eth1

Most switch devices support phys_port_name instead of phys_port_id,
so
virPCIGetNetName() will try to get PF name by it's index - 0. The
problem here is that the PF nedev entry may not be the first.

To fix that, for switch devices, we introduce a new logic to select
the PF uplink netdev according to the content of phys_port_name.
Extend
virPCIGetNetName() with physPortNameRegex variable to get proper
device by it's phys_port_name scheme, for ex., "p[0-9]+$" to get
PF, "pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get exact net device.
So now
virPCIGetNetName() logic work in following sequence:
  - filter by phys_port_id, if it's provided,
  or
  - filter by phys_port_name, if it's regex provided,
  or
  - get net device by it's index (position) in sysfs net directory.
Also, make getting content of iface sysfs files more generic.

Signed-off-by: Dmytro Linkin <dlinkin@xxxxxxxxxx>
Reviewed-by: Adrian Chiris <adrianc@xxxxxxxxxx>

[...]


+/* Represents format of PF's phys_port_name in switchdev mode:
+ * 'p%u' or 'p%us%u'. New line checked since value is readed from sysfs
file.
+ */
+# define VIR_PF_PHYS_PORT_NAME_REGEX ((char
+*)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)")
+

I've come back to look at this patch several times since it was
posted (sorry for the extreme delay in responding), but just can't
figure out what it's doing with this regex and why the regex is
necessary. Not having access to the hardware that it works with is a
bit of a problem, but perhaps I could get a better idea if you gave
a full example of sysfs contents? My concern with using a regex is
that it might work just fine when using one method for net device
naming, but break if that was changed. Also, it seems
counterintuitive that it would be necessary to look for a device
with a name matching a specific pattern; why isn't there simply a
single symbolic link somewhere in the sysfs tree for the net device
that just directly points at its physical port? That would be so
much simpler and more reliable (or at least it would give the
*perception* of being more reliable).

I'll emphasize that regex is being used for phys_port_name, NOT net
device name (they are not the same). Anyway, I'll give an example.

Lets look how virPCIGetNetName() currently work.
At first it try to read phys_port_id of every net device in net
folder, like:
$ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/'
$ cat net/ens1f0/phys_port_id

Imagine we have single pci dual port nic (I hope named it right), eg.
net folder holds net devices for both ports, so read operation will be
successfull and function will return name of first or second port.
For single port or dual pci nics (or for drivers which didn't
implemented phys_port_id) read fails and function fallback to picking
up device by it's index, which not really fine (I'll describe it
later) but work 'cause net folder usualy contains only one net device.

But kernel patch brought third case which not work.

Here are ifaces of ConnectX-5 NIC:
$ ls -l /sys/class/net | cut -d ' ' -f 9-
ens1f0 ->
../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0
ens1f1 ->
../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1
ens1f2 ->
../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2
ens1f3 ->
../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3
ens1f0_0 ->
../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0
ens1f0_1 ->
../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1

ens1f0 & ens1f1 - PFs;
ens1f2 & ens1f3 - VFs created on 1st PF and..
ens1f0_0 & ens1f0_1 - corresponding representors.

Here is content of PFs' pci net folders (for comparison):
$ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net'
ens1f0  ens1f0_0  ens1f0_1
$ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net'
ens1f1

When virPCIGetNetName() called for 2nd PF, for ex., which is in legacy
mode, phys_port_id read fails (Operation not supported), function
falling back to index and return name of first and only net device.
Fine here.
When function called for 1st PF in switchdev mode, sequence is the
same and first net device name is returned. The problem here is that
PF could be not the first device, because order is not persistent and
defined by system.

So approach is to find PF by reading phys_port_name. Ex.:
$ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/'
$ cat ens1f0/phys_port_name
p0
$ cat ens1f0_0/phys_port_name
pf0vf0
$ cat ens1f0_1/phys_port_name
pf0vf1


And here regex is used. We really don't care about exact value of
phys_port_name, 'cause there only one PF net device. And regex also
cover smart nics, which use other phys_port_name scheme.


Okay, in the last 4 hours I had typed *a lot* of stuff here, and thought I was starting to understand how everything worked. But in the meantime I was also exchanging email with Marcelo Leitner about the issue, and he happened to mention that ConnectX-5 cards no longer have multiple PF netdevs on a single PCI device.


I *had been* operating under the impression that multiple PFs on a single PCI device was still the case, and that the entire reason for needing to look for the proper PF netdev was to pick between one of multiple PFs on the single PCI device. I had written up a whole huge wall of text explaining how the method being used in this patch must be only partially complete, because it wasn't matching the phys_port_name of the VF netdev to the phys_port_name of the proper PF netdev.


But then after the email with Marcelo, his offhand statement slowly crept out of my unconscious, prompted along by my realization that your regexp isn't matching the devices that look to be different for different VFs, e.g. pf0vf1, it is matching the plain "p0". After that, Marcelo provided me with access to a machine with a ConnectX-5 card, where I confirmed for myself that there is only a single PF on any PCI device, and that the "p0"-named device is the PF. *Finally* it makes sense why your patch doesn't try to find a device that is matched to the particular VF in any way, but just looks for the device that has a fixed pattern for phys_port_name - it's because any PCI device will only have a single PF netdev, and we just have to figure out which one that is.


Okay, so now that I finally understand, and have deleted the original wall of text.


A few things about the patch:


1) the new utility function that you added looks very useful, but should be added in a separate patch.


2) the patches don't apply cleanly any more.


3) since the regexp is fixed, I don't think it needs to be sent as an argument to the function - it can just be referenced directly when needed.


4) I think it should be explained in the code that in the case of using phys_port_name, there is only a single netdev on the PCI device that can possibly be the PF, whereas in the case of using phys_port_id, there could be multiple PFs, and so we have to match the phys_port_id of the VF (having that bit of info would have saved me lots of agony :-)


If you like I can try splitting the existing patch and fixing the merge conflicts, and put the results on gitlab or something for you to try out. I feel like I should do *something*, after making you wait for so long :-/





[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux