On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote: > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote: > > Yes, my suggestion to use mdev was entirely based on the premise that > > the purpose of this work is to get vfio working.. otherwise I'm unclear > > as to why we'd need a bus in the first place. If this is just for > > containers - we have macvlan offload for years now, with no need for a > > separate device. > > This SF thing is a full fledged VF function, it is not at all like > macvlan. This is perhaps less important for the netdev part of the > world, but the difference is very big for the RDMA side, and should > enable VFIO too.. Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF. I'd perhaps need to learn more about RDMA to appreciate the difference. > > On the RDMA/Intel front, would you mind explaining what the main > > motivation for the special buses is? I'm a little confurious. > > Well, the issue is driver binding. For years we have had these > multi-function netdev drivers that have a single PCI device which must > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I > expect when NVMe over TCP rolls out we will have drivers like cxgb4 > binding to 6 subsytems in total!] What I'm missing is why is it so bad to have a driver register to multiple subsystems. I've seen no end of hacks caused people trying to split their driver too deeply by functionality. Separate sub-drivers, buses and modules. The nfp driver was split up before I upstreamed it, I merged it into one monolithic driver/module. Code is still split up cleanly internally, the architecture doesn't change in any major way. Sure 5% of developers were upset they can't do some partial reloads they were used to, but they got used to the new ways, and 100% of users were happy about the simplicity. For the nfp I think the _real_ reason to have a bus was that it was expected to have some out-of-tree modules bind to it. Something I would not encourage :) Maybe RDMA and storage have some requirements where the reload of the part of the driver is important, IDK.. > > My understanding is MFD was created to help with cases where single > > device has multiple pieces of common IP in it. > > MFD really seems to be good at splitting a device when the HW is > orthogonal at the register level. Ie you might have regs 100-200 for > ethernet and 200-300 for RDMA. > > But this is not how modern HW works, the functional division is more > subtle and more software based. ie on most devices a netdev and rdma > queue are nearly the same, just a few settings make them function > differently. > > So what is needed isn't a split of register set like MFD specializes > in, but a unique per-driver API between the 'core' and 'subsystem' > parts of the multi-subsystem device. Exactly, because the device is one. For my simplistic brain one device means one driver, which can register to as many subsystems as it wants. > > Do modern RDMA cards really share IP across generations? > > What is a generation? Mellanox has had a stable RDMA driver across > many sillicon generations. Intel looks like their new driver will > support at least the last two or more sillicon generations.. > > RDMA drivers are monstrous complex things, there is a big incentive to > not respin them every time a new chip comes out. Ack, but then again none of the drivers gets rewritten from scratch, right? It's not that some "sub-drivers" get reused and some not, no? > > Is there a need to reload the drivers for the separate pieces (I > > wonder if the devlink reload doesn't belong to the device model :(). > > Yes, it is already done, but without driver model support the only way > to reload the rdma driver is to unload the entire module as there is > no 'unbind' The reload is the only thing that I can think of (other than out-of-tree code), but with devlink no I believe it can be solved differently. Thanks a lot for the explanation Jason, much appreciated! The practicality of this is still a little elusive to me, but since Greg seems on board I guess it's just me :)