[PATCH net-next 00/19] Mellanox, mlx5 sub function support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dave, Jiri, Alex,

This series adds the support for mlx5 sub function devices using
mediated device with eswitch switchdev mode.

This series is currently on top of [1], but once [1] is merged to
net-next tree, this series should be resend through usual Saeed's
net-next tree or to netdev net-next.

Mdev alias support patches are already reviewed at vfio/kvm's mailing
list in past. Since mlx5_core driver is first user of the
mdev alias API, those patches are part of this series to merge via
net-next tree.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=net-next-mlx5

Also, once Jason Wang's series [9] is merged, mlx5_core driver will be enhanced to
use class id based matching support for mdev devices.

Abstract:
---------

Mellanox ConnectX device supports dynamic creation of sub function which
are equivalent to a PCI function handled by mlx5_core and mlx5_ib drivers.
It has few few differences to PCI function. They are described below in
overview section.

Mellanox sub function capability allows users to create several hundreds
of networking and/or rdma devices without depending on PCI SR-IOV support.

Motivation:
-----------
User wants to use multiple netdevices and rdma devices in a system
without enabling SR-IOV, possibly more number of devices than what
mellanox ConnectX device supports using SR-IOV.

Such devices will have most if not all of the acceleration and offload
capabilities of mlx5, while they are still light weight and don't demand
any dedicated physical resources such as (pci function, MSIX vectors).

Provision such netdevice and/or rdma device to use in bare-metal
server or provision to a container.

In future, map such sub function device to a VM once ConnectX device
supports it. In such case, it should be able to reuse existing
mediated device (mdev) [2] framework of kernel.

Regardless of how a sub function is used, it is desired to lifecycle
such device in unified way, such as using mdev [2].

User wants to have same level of control, visibility, offloads support
as that of current SR-IOV VFs using eswitch switchdev mode for
Ethernet devices.

Overview:
---------
Mellanox ConnectX sub functions are exposed to user as a mediated
device (mdev) [2] as discussed in RFC [3] and further during
netdevconf0x13 at [4].

mlx5 mediated device (mdev) enables users to create multiple netdevices
and/or RDMA devices from single PCI function.

Each mdev maps to a mlx5 sub function.
mlx5 sub function is similar to PCI VF. However it doesn't have its own
PCI function and MSI-X vectors.

mlx5 mdevs share common PCI resources such as PCI BAR region,
MSI-X interrupts.

Each mdev has its own window in the PCI BAR region, which is
accessible only to that mdev and applications using it.

Each mlx5 sub function has its own resource namespace for RDMA resources.

mdevs are supported when eswitch mode of the devlink instance
is in switchdev mode described in devlink documentation [5].

mdev is identified using a UUID defined by RFC 4122.

Each created mdev has unique 12 letters alias. This alias is used to
derive phys_port_name attribute of the corresponding representor
netdevice. This establishes clear link between mdev device, eswitch
port and eswitch representor netdevice.

systemd udev [6] will be enhanced post this work to have predictable
netdevice names based on the unique and predicable mdev alias of the
mdev device.

For example, an Ethernet netdevice of an mdev device having alias
'aliasfoo123' will be persistently named as enmaliasfoo123, where:
<en> for Ethernet
m = mediated device bus
<alias> = unique mdev alias

Design decisions:
-----------------
1. mdev device (and bus) instead of any new subdevice bus is used
after concluding discussion at [7]. This simplifies and eliminates the
need of new bus, also eliminates any vendor specific bus.

2. mdev device is also chosen to not create multiple tools for creating
netdevice, rdma device, devlink port, its representor netdevice and
representor rdma port.

3. mdev device support in eswitch switchdev mode, gives rich isolation
and reuses existing offload infrastructure of iproute2/tc.

4. mdev device also enjoys existing devlink health reporting infrastructure
uniformely for PF, VF and mdev.

5. Persistent naming of mdev's netdevice scheme is discussed and concluded
at [8] using systemwide unique, predicable mdev alias.

6. Unique eswitch port phys_port_name derivation from the mdev alias usage
is concluded at [8].

User commands examples:
-----------------------

- Set eswitch mode as switchdev mode
    $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

- Create a mdev
    Generate a UUID
    $ UUID=$(uuidgen)
    Create the mdev using UUID
    $ echo $UUID > /sys/class/net/ens2f0_p0/device/mdev_supported_types/mlx5_core-local/create

- Unbind a mdev from vfio_mdev driver
    $ echo $UUID > /sys/bus/mdev/drivers/vfio_mdev/unbind

- Bind a mdev to mlx5_core driver
    $ echo $UUID > /sys/bus/mdev/drivers/mlx5_core/bind

- View netdevice and (optionally) RDMA device in sysfs tree
    $ ls -l /sys/bus/mdev/devices/$UUID/net/
    $ ls -l /sys/bus/mdev/devices/$UUID/infiniband/

- View netdevice and (optionally) RDMA device using iproute2 tools
    $ ip link show
    $ rdma dev show

- Query maximum number of mdevs that can be created
    $ cat /sys/class/net/ens2f0_p0/device/mdev_supported_types/mlx5_core-local/max_mdevs

- Query remaining number of mdevs that can be created
    $ cat /sys/class/net/ens2f0_p0/device/mdev_supported_types/mlx5_core-local/available_instances

- Query an alias of the mdev
    $ cat /sys/bus/mdev/devices/$UUID/alias

Security model:
--------------
This section covers security aspects of mlx5 mediated devices at
host level and at network level.

Host side:
- At present mlx5 mdev is meant to be used only in a host.
It is not meant to be mapped to a VM or access by userspace application
using VFIO framework.
Hence, mlx5_core driver doesn't implement any of the VFIO device specific
callback routines.
Hence, mlx5 mediated device cannot be mapped to a VM or to a userspace
application via VFIO framework.

- At present an mlx5 mdev can be accessed by an application through
its netdevice and/or RDMA device.

- mlx5 mdev does not share PCI BAR with its parent PCI function.

- All mlx5 mdevs of a given parent device share a single PCI BAR.
However each mdev device has a small dedicated window of the PCI BAR.
Hence, one mdev device cannot access PCI BAR or any of the resources
of another mdev device.

- Each mlx5 mdev has its own dedicated event queue through which interrupt
notifications are delivered. Hence, one mlx5 mdev cannot enable/disable
interrupts of other mlx5 mdev. mlx5 mdev cannot enable/disable interrupts
of the parent PCI function.

Network side:
- By default the netdevice and the rdma device of mlx5 mdev cannot send or
receive any packets over the network or to any other mlx5 mdev.

- mlx5 mdev follows devlink eswitch and vport model of PCI SR-IOV PF and VFs.
All traffic is dropped by default in this eswitch model.

- Each mlx5 mdev has one eswitch vport representor netdevice and rdma port.
The user must do necessary configuration through such representor to enable
mlx5 mdev to send and/or receive packets.

Patchset summary:
-----------------
Patch 1 to 6 prepare mlx5 core driver to create mdev.

Patch-1 Moves mlx5 devlink port close to eswitch port
Patch-2 implements sub function vport representor handling
Patch-3,4 Introduces mlx5 sub function lifecycle routines
Patch-5 Extended mlx5 eswitch to enable/disable sub function vports
Patch-6 Registers mlx5 driver with mdev subsystem for mdev lifecycle

Patch-7 to 10 enhances mdev subsystem for unique alias generation

Patch-7 introduces sha1 based mdev alias
Patch-8 Ensures mdev alias is unique amlong all mdev devices
Patch-9 Exposes mdev alias in sysfs file for systemd/udev usage
Patch-10 Introduces mdev alias API to be used by vendor driver
Patch-11 Improves mdev to avoid nested locking with mlx5_core driver
and devlink subsystem

Patch-12 Introduces new devlink mdev port flavour
Patch-13 Registers devlink eswitch port for a sub function

Patch-14 to 17 implements mdev driver by reusing most of the mlx5_core
driver
Patch-14 implements sharing IRQ between sub function and parent PCI device
Patch-15 implements load, unload routine for sub function
Patch-16 implements dma ops for mediated device
Patch-17 implements mdev driver to bind to mdev device

Patch-18 Adds documentation for mdev overview, example and security model
Patch-19 extends sample mtty driver for mdev alias generation

[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=net-next-mlx5
[2] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
[3] https://lkml.org/lkml/2019/3/8/819
[4] https://netdevconf.org/0x13/session.html?workshop-hardware-offload
[5] http://man7.org/linux/man-pages/man8/devlink-dev.8.html.
[6] https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-net_id.c
[7] https://lkml.org/lkml/2019/3/7/696
[8] https://lkml.org/lkml/2019/8/23/146
[9] https://patchwork.ozlabs.org/patch/1190425


Parav Pandit (14):
  net/mlx5: E-switch, Move devlink port close to eswitch port
  net/mlx5: Introduce SF life cycle APIs to allocate/free
  vfio/mdev: Introduce sha1 based mdev alias
  vfio/mdev: Make mdev alias unique among all mdevs
  vfio/mdev: Expose mdev alias in sysfs tree
  vfio/mdev: Introduce an API mdev_alias
  vfio/mdev: Improvise mdev life cycle and parent removal scheme
  devlink: Introduce mdev port flavour
  net/mlx5: Register SF devlink port
  net/mlx5: Add load/unload routines for SF driver binding
  net/mlx5: Implement dma ops and params for mediated device
  net/mlx5: Add mdev driver to bind to mdev devices
  Documentation: net: mlx5: Add mdev usage documentation
  mtty: Optionally support mtty alias

Vu Pham (4):
  net/mlx5: E-Switch, Add SF vport, vport-rep support
  net/mlx5: Introduce SF table framework
  net/mlx5: E-Switch, Enable/disable SF's vport during SF life cycle
  net/mlx5: Add support for mediated devices in switchdev mode

Yuval Avnery (1):
  net/mlx5: Share irqs between SFs and parent PCI device

 .../driver-api/vfio-mediated-device.rst       |   9 +
 .../device_drivers/mellanox/mlx5.rst          | 122 +++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/dev.c |  17 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  69 +++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   8 +
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  94 +---
 .../net/ethernet/mellanox/mlx5/core/en_rep.h  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   6 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  24 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  85 ++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 147 +++++-
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |   3 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  80 +++-
 .../ethernet/mellanox/mlx5/core/meddev/mdev.c | 212 +++++++++
 .../mellanox/mlx5/core/meddev/mdev_driver.c   |  50 +++
 .../ethernet/mellanox/mlx5/core/meddev/sf.c   | 425 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/meddev/sf.h   |  73 +++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  54 +++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  12 +
 .../net/ethernet/mellanox/mlx5/core/vport.c   |   4 +-
 drivers/vfio/mdev/mdev_core.c                 | 198 +++++++-
 drivers/vfio/mdev/mdev_private.h              |   8 +-
 drivers/vfio/mdev/mdev_sysfs.c                |  26 +-
 include/linux/mdev.h                          |   5 +
 include/linux/mlx5/driver.h                   |   8 +-
 include/net/devlink.h                         |   9 +
 include/uapi/linux/devlink.h                  |   5 +
 net/core/devlink.c                            |  32 ++
 samples/vfio-mdev/mtty.c                      |  13 +
 32 files changed, 1678 insertions(+), 143 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/mdev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/mdev_driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/sf.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/sf.h

-- 
2.19.2




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux