>-----Original Message----- >From: Laine Stump <laine@xxxxxxxxxx> >Sent: Thursday, December 17, 2020 5:22 AM >To: libvir-list@xxxxxxxxxx >Cc: Dmytro Linkin <dlinkin@xxxxxxxxxx>; Moshe Levi <moshele@xxxxxxxxxx>; >Adrian Chiris <adrianc@xxxxxxxxxx> >Subject: Re: [PATCH] util: Add phys_port_name support on >virPCIGetNetName > >External email: Use caution opening links or attachments > > >On 12/10/20 11:51 AM, Adrian Chiris wrote: > >> Hi, >> Would be great to get a pair of eyes on this Patch, Thanks! > > >I've looked at it several times and every time would just end up shaking my >head wondering why there isn't one definitive symlink in the VF's sysfs for the >netdev of the physical port. (Part of my lack of response from the last time is >that I didn't know how to respond since I didn't understand why such >roundabout logic should be needed to answer such a basic question, decided I >should look at it again before responding, and then forgot about it :-() > > >Anyway, this time I'm determined to not get up until I've got it figured out (or >at least understand exactly what I don't have figured out)... > > > >>> -----Original Message----- >>> From: Dmytro Linkin <dlinkin@xxxxxxxxxx> >>> Sent: Tuesday, October 27, 2020 10:58 AM >>> To: libvir-list@xxxxxxxxxx >>> Cc: Laine Stump <laine@xxxxxxxxxx>; Adrian Chiris >>> <adrianc@xxxxxxxxxx>; Moshe Levi <moshele@xxxxxxxxxx> >>> Subject: Re: [PATCH] util: Add phys_port_name support on >>> virPCIGetNetName >>> >>> On Mon, Sep 28, 2020 at 12:56:12PM +0300, Dmytro Linkin wrote: >>>> On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote: >>>>> On 8/28/20 6:53 AM, Dmytro Linkin wrote: >>>>>> Current virPCIGetNetName() logic is to get net device name by >>>>>> checking it's phys_port_id, if caller provide it, or by it's index >>>>>> (eg, by it's position at sysfs net directory). This approach >>>>>> worked fine up until linux kernel version 5.8, where NVIDIA >>>>>> Mellanox driver implemented linking of VFs' representors to PCI >>>>>> device in switchdev mode. This mean that device's sysfs net >>>>>> directory will hold >>> multiple net devices. Ex.: >>>>>> $ ls '/sys/bus/pci/devices/0000:82:00.0/net' >>>>>> ens1f0 eth0 eth1 >>>>>> >>>>>> Most switch devices support phys_port_name instead of >>>>>> phys_port_id, so >>>>>> virPCIGetNetName() will try to get PF name by it's index - 0. The >>>>>> problem here is that the PF nedev entry may not be the first. >>>>>> >>>>>> To fix that, for switch devices, we introduce a new logic to >>>>>> select the PF uplink netdev according to the content of >phys_port_name. >>>>>> Extend >>>>>> virPCIGetNetName() with physPortNameRegex variable to get proper >>>>>> device by it's phys_port_name scheme, for ex., "p[0-9]+$" to get >>>>>> PF, "pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get exact net device. >>>>>> So now >>>>>> virPCIGetNetName() logic work in following sequence: >>>>>> - filter by phys_port_id, if it's provided, >>>>>> or >>>>>> - filter by phys_port_name, if it's regex provided, >>>>>> or >>>>>> - get net device by it's index (position) in sysfs net directory. >>>>>> Also, make getting content of iface sysfs files more generic. >>>>>> >>>>>> Signed-off-by: Dmytro Linkin <dlinkin@xxxxxxxxxx> >>>>>> Reviewed-by: Adrian Chiris <adrianc@xxxxxxxxxx> >>>>> >>>>> [...] >>>>> >>>>> >>>>>> +/* Represents format of PF's phys_port_name in switchdev mode: >>>>>> + * 'p%u' or 'p%us%u'. New line checked since value is readed from >>>>>> +sysfs >>> file. >>>>>> + */ >>>>>> +# define VIR_PF_PHYS_PORT_NAME_REGEX ((char >>>>>> +*)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)") >>>>>> + >>>>> >>>>> I've come back to look at this patch several times since it was >>>>> posted (sorry for the extreme delay in responding), but just can't >>>>> figure out what it's doing with this regex and why the regex is >>>>> necessary. Not having access to the hardware that it works with is >>>>> a bit of a problem, but perhaps I could get a better idea if you >>>>> gave a full example of sysfs contents? My concern with using a >>>>> regex is that it might work just fine when using one method for net >>>>> device naming, but break if that was changed. Also, it seems >>>>> counterintuitive that it would be necessary to look for a device >>>>> with a name matching a specific pattern; why isn't there simply a >>>>> single symbolic link somewhere in the sysfs tree for the net device >>>>> that just directly points at its physical port? That would be so >>>>> much simpler and more reliable (or at least it would give the >>>>> *perception* of being more reliable). >>>>> >>>> I'll emphasize that regex is being used for phys_port_name, NOT net >>>> device name (they are not the same). Anyway, I'll give an example. >>>> >>>> Lets look how virPCIGetNetName() currently work. >>>> At first it try to read phys_port_id of every net device in net >>>> folder, like: >>>> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/' >>>> $ cat net/ens1f0/phys_port_id >>>> >>>> Imagine we have single pci dual port nic (I hope named it right), eg. >>>> net folder holds net devices for both ports, so read operation will >>>> be successfull and function will return name of first or second port. >>>> For single port or dual pci nics (or for drivers which didn't >>>> implemented phys_port_id) read fails and function fallback to >>>> picking up device by it's index, which not really fine (I'll >>>> describe it >>>> later) but work 'cause net folder usualy contains only one net device. >>>> >>>> But kernel patch brought third case which not work. >>>> >>>> Here are ifaces of ConnectX-5 NIC: >>>> $ ls -l /sys/class/net | cut -d ' ' -f 9- >>>> ens1f0 -> >>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0 >>>> ens1f1 -> >>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1 >>>> ens1f2 -> >>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2 >>>> ens1f3 -> >>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3 >>>> ens1f0_0 -> >>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0 >>>> ens1f0_1 -> >>>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1 >>>> >>>> ens1f0 & ens1f1 - PFs; >>>> ens1f2 & ens1f3 - VFs created on 1st PF and.. >>>> ens1f0_0 & ens1f0_1 - corresponding representors. >>>> >>>> Here is content of PFs' pci net folders (for comparison): >>>> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net' >>>> ens1f0 ens1f0_0 ens1f0_1 >>>> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net' >>>> ens1f1 >>>> >>>> When virPCIGetNetName() called for 2nd PF, for ex., which is in >>>> legacy mode, phys_port_id read fails (Operation not supported), >>>> function falling back to index and return name of first and only net device. >>>> Fine here. >>>> When function called for 1st PF in switchdev mode, sequence is the >>>> same and first net device name is returned. The problem here is that >>>> PF could be not the first device, because order is not persistent >>>> and defined by system. >>>> >>>> So approach is to find PF by reading phys_port_name. Ex.: >>>> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/' >>>> $ cat ens1f0/phys_port_name >>>> p0 >>>> $ cat ens1f0_0/phys_port_name >>>> pf0vf0 >>>> $ cat ens1f0_1/phys_port_name >>>> pf0vf1 > >>>> >>>> And here regex is used. We really don't care about exact value of >>>> phys_port_name, 'cause there only one PF net device. And regex also >>>> cover smart nics, which use other phys_port_name scheme. > > >Okay, in the last 4 hours I had typed *a lot* of stuff here, and thought I was >starting to understand how everything worked. But in the meantime I was >also exchanging email with Marcelo Leitner about the issue, and he happened >to mention that ConnectX-5 cards no longer have multiple PF netdevs on a >single PCI device. Indeed, single PF netdev per PCI physical function is there since ConnectX-4 However we still need to support ConnectX-3 NICs which follow the multiple PF netdev model (the "old model"). > > >I *had been* operating under the impression that multiple PFs on a single PCI >device was still the case, and that the entire reason for needing to look for the >proper PF netdev was to pick between one of multiple PFs on the single PCI >device. I had written up a whole huge wall of text explaining how the method >being used in this patch must be only partially complete, because it wasn't >matching the phys_port_name of the VF netdev to the phys_port_name of >the proper PF netdev. > > >But then after the email with Marcelo, his offhand statement slowly crept out >of my unconscious, prompted along by my realization that your regexp isn't >matching the devices that look to be different for different VFs, e.g. pf0vf1, it >is matching the plain "p0". After that, Marcelo provided me with access to a >machine with a ConnectX-5 card, where I confirmed for myself that there is >only a single PF on any PCI device, and that the "p0"-named device is the PF. >*Finally* it makes sense why your patch doesn't try to find a device that is >matched to the particular VF in any way, but just looks for the device that has >a fixed pattern for phys_port_name - it's because any PCI device will only >have a single PF netdev, and we just have to figure out which one that is. > > >Okay, so now that I finally understand, and have deleted the original wall of >text. > > >A few things about the patch: > > >1) the new utility function that you added looks very useful, but should be >added in a separate patch. > > >2) the patches don't apply cleanly any more. > > >3) since the regexp is fixed, I don't think it needs to be sent as an argument to >the function - it can just be referenced directly when needed. > > >4) I think it should be explained in the code that in the case of using >phys_port_name, there is only a single netdev on the PCI device that can >possibly be the PF, whereas in the case of using phys_port_id, there could be >multiple PFs, and so we have to match the phys_port_id of the VF (having >that bit of info would have saved me lots of agony :-) Makes sense, definitely we should add some more info as stated in 4. > > >If you like I can try splitting the existing patch and fixing the merge conflicts, >and put the results on gitlab or something for you to try out. I feel like I should >do *something*, after making you wait for so long :-/ > That would be great, we are on shutdown during next week, so will be able to look at it only the following week.