On Mon, Sep 28, 2020 at 12:56:12PM +0300, Dmytro Linkin wrote: > On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote: > > On 8/28/20 6:53 AM, Dmytro Linkin wrote: > > >Current virPCIGetNetName() logic is to get net device name by checking > > >it's phys_port_id, if caller provide it, or by it's index (eg, by it's > > >position at sysfs net directory). This approach worked fine up until > > >linux kernel version 5.8, where NVIDIA Mellanox driver implemented > > >linking of VFs' representors to PCI device in switchdev mode. This mean > > >that device's sysfs net directory will hold multiple net devices. Ex.: > > > > > >$ ls '/sys/bus/pci/devices/0000:82:00.0/net' > > >ens1f0 eth0 eth1 > > > > > >Most switch devices support phys_port_name instead of phys_port_id, so > > >virPCIGetNetName() will try to get PF name by it's index - 0. The > > >problem here is that the PF nedev entry may not be the first. > > > > > >To fix that, for switch devices, we introduce a new logic to select the > > >PF uplink netdev according to the content of phys_port_name. Extend > > >virPCIGetNetName() with physPortNameRegex variable to get proper device > > >by it's phys_port_name scheme, for ex., "p[0-9]+$" to get PF, > > >"pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get exact net device. So now > > >virPCIGetNetName() logic work in following sequence: > > > - filter by phys_port_id, if it's provided, > > > or > > > - filter by phys_port_name, if it's regex provided, > > > or > > > - get net device by it's index (position) in sysfs net directory. > > >Also, make getting content of iface sysfs files more generic. > > > > > >Signed-off-by: Dmytro Linkin <dlinkin@xxxxxxxxxx> > > >Reviewed-by: Adrian Chiris <adrianc@xxxxxxxxxx> > > > > > > [...] > > > > > > > > > >+/* Represents format of PF's phys_port_name in switchdev mode: > > >+ * 'p%u' or 'p%us%u'. New line checked since value is readed from sysfs file. > > >+ */ > > >+# define VIR_PF_PHYS_PORT_NAME_REGEX ((char *)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)") > > >+ > > > > > > I've come back to look at this patch several times since it was > > posted (sorry for the extreme delay in responding), but just can't > > figure out what it's doing with this regex and why the regex is > > necessary. Not having access to the hardware that it works with is a > > bit of a problem, but perhaps I could get a better idea if you gave > > a full example of sysfs contents? My concern with using a regex is > > that it might work just fine when using one method for net device > > naming, but break if that was changed. Also, it seems > > counterintuitive that it would be necessary to look for a device > > with a name matching a specific pattern; why isn't there simply a > > single symbolic link somewhere in the sysfs tree for the net device > > that just directly points at its physical port? That would be so > > much simpler and more reliable (or at least it would give the > > *perception* of being more reliable). > > > > I'll emphasize that regex is being used for phys_port_name, NOT net > device name (they are not the same). Anyway, I'll give an example. > > Lets look how virPCIGetNetName() currently work. > At first it try to read phys_port_id of every net device in net > folder, like: > $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/' > $ cat net/ens1f0/phys_port_id > > Imagine we have single pci dual port nic (I hope named it right), > eg. net folder holds net devices for both ports, so read operation > will be successfull and function will return name of first or second > port. > For single port or dual pci nics (or for drivers which didn't > implemented phys_port_id) read fails and function fallback to > picking up device by it's index, which not really fine (I'll describe > it later) but work 'cause net folder usualy contains only one net > device. > > But kernel patch brought third case which not work. > > Here are ifaces of ConnectX-5 NIC: > $ ls -l /sys/class/net | cut -d ' ' -f 9- > ens1f0 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0 > ens1f1 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1 > ens1f2 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2 > ens1f3 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3 > ens1f0_0 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0 > ens1f0_1 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1 > > ens1f0 & ens1f1 - PFs; > ens1f2 & ens1f3 - VFs created on 1st PF and.. > ens1f0_0 & ens1f0_1 - corresponding representors. > > Here is content of PFs' pci net folders (for comparison): > $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net' > ens1f0 ens1f0_0 ens1f0_1 > $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net' > ens1f1 > > When virPCIGetNetName() called for 2nd PF, for ex., which is in legacy > mode, phys_port_id read fails (Operation not supported), function > falling back to index and return name of first and only net device. > Fine here. > When function called for 1st PF in switchdev mode, sequence is the same > and first net device name is returned. The problem here is that PF could > be not the first device, because order is not persistent and defined by > system. > > So approach is to find PF by reading phys_port_name. Ex.: > $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/' > $ cat ens1f0/phys_port_name > p0 > $ cat ens1f0_0/phys_port_name > pf0vf0 > $ cat ens1f0_1/phys_port_name > pf0vf1 > > And here regex is used. We really don't care about exact value of > phys_port_name, 'cause there only one PF net device. And regex also > cover smart nics, which use other phys_port_name scheme. > > So, WDYT? Ping