Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> > > >
> > > > > That said, it only works at the driver level. So if the firmware is
> > > > > the one that is having to do this it also occured to me that if this
> > > > > update happened on FLR that would probably be preferred.
> > > >
> > > > FLR is not free, I'd prefer not to require it just for some
> > > > philosophical reason.
> > > >
> > > > > Since the mlx5 already supports devlink I don't see any reason why the
> > > > > driver couldn't be extended to also support the devlink resource
> > > > > interface and apply it to interrupts.
> > > >
> > > > So you are OK with the PF changing the VF as long as it is devlink not
> > > > sysfs? Seems rather arbitary?
> > > >
> > > > Leon knows best, but if I recall devlink becomes wonky when the VF
> > > > driver doesn't provide a devlink instance. How does it do reload of a
> > > > VF then?
> > > >
> > > > I think you end up with essentially the same logic as presented here
> > > > with sysfs.
> > >
> > > The reasons why I decided to go with sysfs are:
> > > 1. This MSI-X table size change is applicable to ALL devices in the world,
> > > and not only netdev.
> >
> > In the PCI world MSI-X table size is a read only value. That is why I
> > am pushing back on this as a PCI interface.
>
> And it stays read-only.

Only if you come at it directly. What this is adding is a back door
that is visible as a part of the VF sysfs.

> >
> > > 2. This is purely PCI field and apply equally with same logic to all
> > > subsystems and not to netdev only.
> >
> > Again, calling this "purely PCI" is the sort of wording that has me
> > concerned. I would prefer it if we avoid that wording. There is much
> > more to this than just modifying the table size field. The firmware is
> > having to shift resources between devices and this potentially has an
> > effect on the entire part, not just one VF.
>
> It is internal to HW implementation, dumb device can solve it differently.

That is my point. I am worried about "dumb devices" that may follow. I
would like to see the steps that should be taken to prevent these sort
of things called out specifically. Basically this isn't just modifying
the PCIe config space, it is actually resizing the PBA and MSI-X
table.

> >
> > > 3. The sysfs interface is the standard way of configuring PCI/core, not
> > > devlink.
> >
> > This isn't PCI core that is being configured. It is the firmware for
> > the device. You are working with resources that are shared between
> > multiple functions.
>
> I'm ensuring that "lspci -vv .." will work correctly after such change.
> It is PCI core responsibility.

The current code doesn't work on anything with a driver loaded on it.
In addition the messaging provided is fairly minimal which results in
an interface that will be difficult to understand when it doesn't
work. In addition there is currently only one piece of hardware that
works with this interface which is the mlx5. My concern is this is
adding overhead to all VFs that will not be used by most SR-IOV
capable devices. In my view it would make much more sense to have a
top-down approach instead of bottom-up where the PF is registering
interfaces for the VFs.

If you want yet another compromise I would be much happier with the PF
registering the sysfs interfaces on the VFs rather than the VFs
registering the interface and hoping the PF supports it. At least with
that you are guaranteed the PF will respond to the interface when it
is registered.

> >
> > > 4. This is how orchestration software provisioning VFs already. It fits
> > > real world usage of SR-IOV, not the artificial one that is proposed during
> > > the discussion.
> >
> > What do you mean this is how they are doing it already? Do you have
> > something out-of-tree and that is why you are fighting to keep the
> > sysfs? If so that isn't a valid argument.
>
> I have Kubernetes and OpenStack, indeed they are not part of the kernel tree.
> They already use sriov_driver_autoprobe sysfs knob to disable autobind
> before even starting. They configure MACs and bind VFs through sysfs/netlink
> already. For them, the read/write of sysfs that is going to be bound to
> the already created VM with known CPU properties, fits perfectly.

By that argument the same could be said about netlink. What I don't
get is why it is okay to configure the MAC through netlink but
suddenly when we are talking about interrupts it is out of the
question. As far as the binding that is the driver interface which is
more or less grandfathered in anyway as there aren't too many ways to
deal with them as there isn't an alternate interface for the drivers
to define support.

> >
> > > So the idea to use devlink just because mlx5 supports it, sound really
> > > wrong to me. If it was other driver from another subsystem without
> > > devlink support, the request to use devlink won't never come.
> > >
> > > Thanks
> >
> > I am suggesting the devlink resources interface because it would be a
> > VERY good fit for something like this. By the definition of it:
> > ``devlink`` provides the ability for drivers to register resources, which
> > can allow administrators to see the device restrictions for a given
> > resource, as well as how much of the given resource is currently
> > in use. Additionally, these resources can optionally have configurable size.
> > This could enable the administrator to limit the number of resources that
> > are used.
>
> It is not resource, but HW objects. The devlink doesn't even see the VFs
> as long as they are not bound to the drivers.
>
> This is an example:
>
> [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_drivers_autoprobe
> [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
> [ 2370.579711] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3)
> [root@vm ~]# echo 2 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
> [ 2377.663666] mlx5_core 0000:01:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3)
> [ 2377.777010] pci 0000:01:00.1: [15b3:101c] type 00 class 0x020000
> [ 2377.784903] pci 0000:01:00.2: [15b3:101c] type 00 class 0x020000
> [root@vm ~]# devlink dev
> pci/0000:01:00.0
> [root@vm ~]# lspci |grep nox
> 01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> 01:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
> 01:00.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
>
> So despite us having 2 VFs ready to be given to VMs, administrator doesn't
> see them as devices.

The MSI-X vectors are a resource assigned to hardware objects. It just
depends on how you want to look at things. Right now you have the VFs
register an interface on behalf of the PF. I am arguing it would be
better to have the PF register an interface on behalf of the VFs.
Ultimately the PF is responsible for creating the VFs in the first
place. I don't see it as that much of a leap to have the
mlx5_sriov_enable call register interfaces for the VFs so that you can
configure the MSI-X vectors from the PF, and then tear them down
before it frees the VFs. Having the VFs do the work seems error prone
since it is assuming the interfaces are there on the PF when in all
cases but one (mlx5) it currently isn't.

> >
> > Even looking over the example usage I don't see there being much to
> > prevent you from applying it to this issue. In addition it has the
> > idea of handling changes that cannot be immediately applied already
> > included. Your current solution doesn't have a good way of handling
> > that and instead just aborts with an error.
>
> Yes, because it is HW resource that should be applied immediately to
> make sure that it is honored, before it is committed to the users.

The problem is you cannot do that at all if the driver is already
loaded. One advantage of using something like devlink is that you
could potentially have the VF driver help to coordinate things so you
could have the case where the VF has the mlx5 driver loaded work
correctly where you would update the MSI-X vector count and then
trigger the driver reload via devlink.

> It is very tempting to use devlink everywhere, but it is really wrong
> tool for this scenario.

We can agree to disagree there. I am not a fan of sysfs being applied
everywhere either. The problem is it is an easy goto when someone is
looking for a quick and dirty solution and often leads to more
problems later as it usually misses critical path locking issues and
the like.

Especially when it is making a subordinate interface look like the
MSI-X table size is somehow writable. I would much rather the creation
of any interface controlled more directly by the PF, or at a minimum
have the PF registering the interfaces rather than leaving this up to
the VF in the hopes that the PF provides the functionality needed to
service the request.



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux