On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote: > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote: > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote: > > > > > > > That said, it only works at the driver level. So if the firmware is > > > > the one that is having to do this it also occured to me that if this > > > > update happened on FLR that would probably be preferred. > > > > > > FLR is not free, I'd prefer not to require it just for some > > > philosophical reason. > > > > > > > Since the mlx5 already supports devlink I don't see any reason why the > > > > driver couldn't be extended to also support the devlink resource > > > > interface and apply it to interrupts. > > > > > > So you are OK with the PF changing the VF as long as it is devlink not > > > sysfs? Seems rather arbitary? > > > > > > Leon knows best, but if I recall devlink becomes wonky when the VF > > > driver doesn't provide a devlink instance. How does it do reload of a > > > VF then? > > > > > > I think you end up with essentially the same logic as presented here > > > with sysfs. > > > > The reasons why I decided to go with sysfs are: > > 1. This MSI-X table size change is applicable to ALL devices in the world, > > and not only netdev. > > In the PCI world MSI-X table size is a read only value. That is why I > am pushing back on this as a PCI interface. And it stays read-only. > > > 2. This is purely PCI field and apply equally with same logic to all > > subsystems and not to netdev only. > > Again, calling this "purely PCI" is the sort of wording that has me > concerned. I would prefer it if we avoid that wording. There is much > more to this than just modifying the table size field. The firmware is > having to shift resources between devices and this potentially has an > effect on the entire part, not just one VF. It is internal to HW implementation, dumb device can solve it differently. > > > 3. The sysfs interface is the standard way of configuring PCI/core, not > > devlink. > > This isn't PCI core that is being configured. It is the firmware for > the device. You are working with resources that are shared between > multiple functions. I'm ensuring that "lspci -vv .." will work correctly after such change. It is PCI core responsibility. > > > 4. This is how orchestration software provisioning VFs already. It fits > > real world usage of SR-IOV, not the artificial one that is proposed during > > the discussion. > > What do you mean this is how they are doing it already? Do you have > something out-of-tree and that is why you are fighting to keep the > sysfs? If so that isn't a valid argument. I have Kubernetes and OpenStack, indeed they are not part of the kernel tree. They already use sriov_driver_autoprobe sysfs knob to disable autobind before even starting. They configure MACs and bind VFs through sysfs/netlink already. For them, the read/write of sysfs that is going to be bound to the already created VM with known CPU properties, fits perfectly. > > > So the idea to use devlink just because mlx5 supports it, sound really > > wrong to me. If it was other driver from another subsystem without > > devlink support, the request to use devlink won't never come. > > > > Thanks > > I am suggesting the devlink resources interface because it would be a > VERY good fit for something like this. By the definition of it: > ``devlink`` provides the ability for drivers to register resources, which > can allow administrators to see the device restrictions for a given > resource, as well as how much of the given resource is currently > in use. Additionally, these resources can optionally have configurable size. > This could enable the administrator to limit the number of resources that > are used. It is not resource, but HW objects. The devlink doesn't even see the VFs as long as they are not bound to the drivers. This is an example: [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_drivers_autoprobe [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs [ 2370.579711] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3) [root@vm ~]# echo 2 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs [ 2377.663666] mlx5_core 0000:01:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3) [ 2377.777010] pci 0000:01:00.1: [15b3:101c] type 00 class 0x020000 [ 2377.784903] pci 0000:01:00.2: [15b3:101c] type 00 class 0x020000 [root@vm ~]# devlink dev pci/0000:01:00.0 [root@vm ~]# lspci |grep nox 01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] 01:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] So despite us having 2 VFs ready to be given to VMs, administrator doesn't see them as devices. > > Even looking over the example usage I don't see there being much to > prevent you from applying it to this issue. In addition it has the > idea of handling changes that cannot be immediately applied already > included. Your current solution doesn't have a good way of handling > that and instead just aborts with an error. Yes, because it is HW resource that should be applied immediately to make sure that it is honored, before it is committed to the users. It is very tempting to use devlink everywhere, but it is really wrong tool for this scenario. Thanks