On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote: > On 8/28/20 6:53 AM, Dmytro Linkin wrote: > >Current virPCIGetNetName() logic is to get net device name by checking > >it's phys_port_id, if caller provide it, or by it's index (eg, by it's > >position at sysfs net directory). This approach worked fine up until > >linux kernel version 5.8, where NVIDIA Mellanox driver implemented > >linking of VFs' representors to PCI device in switchdev mode. This mean > >that device's sysfs net directory will hold multiple net devices. Ex.: > > > >$ ls '/sys/bus/pci/devices/0000:82:00.0/net' > >ens1f0 eth0 eth1 > > > >Most switch devices support phys_port_name instead of phys_port_id, so > >virPCIGetNetName() will try to get PF name by it's index - 0. The > >problem here is that the PF nedev entry may not be the first. > > > >To fix that, for switch devices, we introduce a new logic to select the > >PF uplink netdev according to the content of phys_port_name. Extend > >virPCIGetNetName() with physPortNameRegex variable to get proper device > >by it's phys_port_name scheme, for ex., "p[0-9]+$" to get PF, > >"pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get exact net device. So now > >virPCIGetNetName() logic work in following sequence: > > - filter by phys_port_id, if it's provided, > > or > > - filter by phys_port_name, if it's regex provided, > > or > > - get net device by it's index (position) in sysfs net directory. > >Also, make getting content of iface sysfs files more generic. > > > >Signed-off-by: Dmytro Linkin <dlinkin@xxxxxxxxxx> > >Reviewed-by: Adrian Chiris <adrianc@xxxxxxxxxx> > > > [...] > > > > > >+/* Represents format of PF's phys_port_name in switchdev mode: > >+ * 'p%u' or 'p%us%u'. New line checked since value is readed from sysfs file. > >+ */ > >+# define VIR_PF_PHYS_PORT_NAME_REGEX ((char *)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)") > >+ > > > I've come back to look at this patch several times since it was > posted (sorry for the extreme delay in responding), but just can't > figure out what it's doing with this regex and why the regex is > necessary. Not having access to the hardware that it works with is a > bit of a problem, but perhaps I could get a better idea if you gave > a full example of sysfs contents? My concern with using a regex is > that it might work just fine when using one method for net device > naming, but break if that was changed. Also, it seems > counterintuitive that it would be necessary to look for a device > with a name matching a specific pattern; why isn't there simply a > single symbolic link somewhere in the sysfs tree for the net device > that just directly points at its physical port? That would be so > much simpler and more reliable (or at least it would give the > *perception* of being more reliable). > I'll emphasize that regex is being used for phys_port_name, NOT net device name (they are not the same). Anyway, I'll give an example. Lets look how virPCIGetNetName() currently work. At first it try to read phys_port_id of every net device in net folder, like: $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/' $ cat net/ens1f0/phys_port_id Imagine we have single pci dual port nic (I hope named it right), eg. net folder holds net devices for both ports, so read operation will be successfull and function will return name of first or second port. For single port or dual pci nics (or for drivers which didn't implemented phys_port_id) read fails and function fallback to picking up device by it's index, which not really fine (I'll describe it later) but work 'cause net folder usualy contains only one net device. But kernel patch brought third case which not work. Here are ifaces of ConnectX-5 NIC: $ ls -l /sys/class/net | cut -d ' ' -f 9- ens1f0 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0 ens1f1 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1 ens1f2 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2 ens1f3 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3 ens1f0_0 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0 ens1f0_1 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1 ens1f0 & ens1f1 - PFs; ens1f2 & ens1f3 - VFs created on 1st PF and.. ens1f0_0 & ens1f0_1 - corresponding representors. Here is content of PFs' pci net folders (for comparison): $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net' ens1f0 ens1f0_0 ens1f0_1 $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net' ens1f1 When virPCIGetNetName() called for 2nd PF, for ex., which is in legacy mode, phys_port_id read fails (Operation not supported), function falling back to index and return name of first and only net device. Fine here. When function called for 1st PF in switchdev mode, sequence is the same and first net device name is returned. The problem here is that PF could be not the first device, because order is not persistent and defined by system. So approach is to find PF by reading phys_port_name. Ex.: $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/' $ cat ens1f0/phys_port_name p0 $ cat ens1f0_0/phys_port_name pf0vf0 $ cat ens1f0_1/phys_port_name pf0vf1 And here regex is used. We really don't care about exact value of phys_port_name, 'cause there only one PF net device. And regex also cover smart nics, which use other phys_port_name scheme. So, WDYT?