On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote: > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it > > to vfio, so using a mdev for something not related to vfio seems like > > a poor choice. > > Yes, my suggestion to use mdev was entirely based on the premise that > the purpose of this work is to get vfio working.. otherwise I'm unclear > as to why we'd need a bus in the first place. If this is just for > containers - we have macvlan offload for years now, with no need for a > separate device. This SF thing is a full fledged VF function, it is not at all like macvlan. This is perhaps less important for the netdev part of the world, but the difference is very big for the RDMA side, and should enable VFIO too.. > On the RDMA/Intel front, would you mind explaining what the main > motivation for the special buses is? I'm a little confurious. Well, the issue is driver binding. For years we have had these multi-function netdev drivers that have a single PCI device which must bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I expect when NVMe over TCP rolls out we will have drivers like cxgb4 binding to 6 subsytems in total!] Today most of this is a big hack where the PCI device binds to the netdev driver and then the other drivers in different subsystems 'discover' that an appropriate netdev is plugged in using various unique, hacky and ugly means. For instance cxgb4 duplicates a chunk of the device core, see cxgb4_register_uld() for example. Other drivers try to use netdev notifiers, and various other wild things. So, the general concept is to use the driver model to manage driver binding. A multi-subsystem driver would have several parts: - A pci_driver which binds to the pci_device (the core) It creates, on a bus, struct ??_device's for the other subsystems that this HW supports. ie if the chip supports netdev then a ??_device that binds to the netdev driver is created, same for RDMA - A ??_driver in netdev binds to the device and accesses the core API - A ??_driver in RDMA binds to the device and accesses the core API - A ??_driver in SCSI binds to the device and accesses the core API Now the driver model directly handles all binding, autoloading, discovery, etc, and 'netdev' is just another consumer of 'core' functionality. For something like mlx5 the 'core' is the stuff in drivers/net/ethernet/mellanox/mlx5/core/*.c, give or take. It is broadly generic stuff like send commands, create queues, manage HW resources, etc. There has been some lack of clarity on what the ?? should be. People have proposed platform and MFD, and those seem to be no-goes. So, it looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's guidance. Though I'm wondering if we should have a 'multi_subsystem_device' that was really just about passing a 'void *core_handle' from the 'core' (ie the bus) to the driver (ie RDMA, netdev, etc). It seems weakly defined, but also exactly what every driver doing this needs.. It is basically what this series is abusing mdev to accomplish. > My understanding is MFD was created to help with cases where single > device has multiple pieces of common IP in it. MFD really seems to be good at splitting a device when the HW is orthogonal at the register level. Ie you might have regs 100-200 for ethernet and 200-300 for RDMA. But this is not how modern HW works, the functional division is more subtle and more software based. ie on most devices a netdev and rdma queue are nearly the same, just a few settings make them function differently. So what is needed isn't a split of register set like MFD specializes in, but a unique per-driver API between the 'core' and 'subsystem' parts of the multi-subsystem device. > Do modern RDMA cards really share IP across generations? What is a generation? Mellanox has had a stable RDMA driver across many sillicon generations. Intel looks like their new driver will support at least the last two or more sillicon generations.. RDMA drivers are monstrous complex things, there is a big incentive to not respin them every time a new chip comes out. > Is there a need to reload the drivers for the separate pieces (I > wonder if the devlink reload doesn't belong to the device model :(). Yes, it is already done, but without driver model support the only way to reload the rdma driver is to unload the entire module as there is no 'unbind' Jason