On 12/14/20 10:48 PM, Parav Pandit wrote: > >> From: Alexander Duyck <alexander.duyck@xxxxxxxxx> >> Sent: Tuesday, December 15, 2020 7:24 AM >> >> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@xxxxxxxxxx> >> wrote: >>> >>> Hi Dave, Jakub, Jason, >>> >> >> Just to clarify a few things for myself. You mention virtualization and SR-IOV >> in your patch description but you cannot support direct assignment with this >> correct? > Correct. it cannot be directly assigned. > >> The idea here is simply logical partitioning of an existing network >> interface, correct? > No. Idea is to spawn multiple functions from a single PCI device. > These functions are not born in PCI device and in OS until they are created by user. > Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3]. > I better not repeat all of it here again. Please go through it. > If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail. > >> So this isn't so much a solution for virtualization, but may >> work better for containers. I view this as an important distinction to make as >> the first thing that came to mind when I read this was mediated devices >> which is similar, but focused only on the virtualization case: >> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated- >> device.html >> > Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device. > We are not going back to it anymore. > It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more. > >> Rather than calling this a subfunction, would it make more sense to call it >> something such as a queue set? > No, queue is just one way to send and receive data/packets. > Jason and Saeed explained and discussed this piece to you and others during v0 few weeks back at [1], [2], [3]. > Please take a look. > >> So in terms of ways to go I would argue this is likely better. However one >> downside is that we are going to end up seeing each subfunction being >> different from driver to driver and vendor to vendor which I would argue >> was also one of the problems with SR-IOV as you end up with a bit of vendor >> lock-in as a result of this feature since each vendor will be providing a >> different interface. >> > Each and several vendors provided unified interface for managing VFs. i.e. > (a) enable/disable was via vendor neutral sysfs > (b) sriov capability exposed via standard pci capability and sysfs > (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink > Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced. > > So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'. > >>> A Subfunction supports eswitch representation through which it >>> supports tc offloads. User must configure eswitch to send/receive >>> packets from/to subfunction port. >>> >>> Subfunctions share PCI level resources such as PCI MSI-X IRQs with >>> their other subfunctions and/or with its parent PCI function. >> >> This piece to the architecture for this has me somewhat concerned. If all your >> resources are shared and > All resources are not shared. > >> you are allowing devices to be created >> incrementally you either have to pre-partition the entire function which >> usually results in limited resources for your base setup, or free resources >> from existing interfaces and redistribute them as things change. I would be >> curious which approach you are taking here? So for example if you hit a >> certain threshold will you need to reset the port and rebalance the IRQs >> between the various functions? > No. Its works bit differently for mlx5 device. > When base function is started, it started as if it doesn't have any subfunctions. > When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants. > > For example, PCI PF uses BAR 0, while subfunctions uses BAR 2. > For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF. > In future, yes, a dedicated IRQs per SF is likely desired. > Sridhar also talked about limiting number of queues to a subfunction. > I believe there will be resources/attributes of the function to be controlled. > devlink already provides rich interface to achieve that using devlink resources [8]. > > [..] > >>> $ ip link show >>> 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state >> DOWN mode DEFAULT group default qlen 1000 >>> link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff >>> altname enp6s0f0np0 >>> 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN >> mode DEFAULT group default qlen 1000 >>> link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff> >> >> I assume that p0sf88 is supposed to be the newly created subfunction. >> However I thought the naming was supposed to be the same as what you are >> referring to in the devlink, or did I miss something? >> > I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc). > I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects. > Hope below description clarifies a bit. > There are two netdevices. > (a) representor netdevice, attached to the devlink port of the eswitch > (b) netdevice of the SF used by the end application (in your example, this is assigned to container). > > Both netdevice follow obviously a different naming scheme. > Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher. > It is based on phys_port_name sysfs attribute. > This is same for existing PF and SF representors exist for year+ now. Further used by subfunction. > > For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads > phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container. > >>> After use inactivate the function: >>> $ devlink port function set ens2f0npf0sf88 state inactive >>> >>> Now delete the subfunction port: >>> $ devlink port del ens2f0npf0sf88 >> >> This seems wrong to me as it breaks the symmetry with the port add >> command and > Example of the representor device is only to make life easier for the user. > Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit). > I explained this in a thread with Sridhar at [6]. > In short devlink port del <bus/device_name/port_index command is just fine. > Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation. > I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported. > >> assumes you have ownership of the interface in the host. I >> would much prefer to to see the same arguments that were passed to the >> add command being used to do the teardown as that would allow for the >> parent function to create the object, assign it to a container namespace, and >> not need to pull it back in order to destroy it. > Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases. > So port delete command works on the port index. > Host doesn't need to pull it back to destroy it. It is destroyed via port del command. > > [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@xxxxxxxxxx/ > [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@xxxxxxxxxx/ > [3] https://lore.kernel.org/netdev/20201120161659.GE917484@xxxxxxxxxx/ > [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/ > [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@xxxxxxxxxxxx/ > [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@xxxxxxxxxx/T/#u > [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html > Seems to be a repeated line of questions. You might want to add these FAQs, responses and references to the subfunction document once this set gets merged.