On Mon, Jul 25, 2016 at 11:30:48AM -0500, Christoph Lameter wrote: > > > We could easily do that following naming conventions for partitions or so. > > > Why would doing so damage the API capabilities? Seems that they are > > > sufficiently screwed up already. Cleaning that up could help quite a bit. > > > > The current API is problematic because we try to both be like netdev > > in that all devices are accessible (rdma_cm) and at the same with with > > individual per-device chardevs (uverbs0). > > Device? uverbs is not a device. A particular connectx3 connected to the > pci bus is. And it should follow establish naming conventions etc. Please > lets drop the crap that is there now. If you use the notion of a device > the way it is designed to then we would have less issues. device in the RDMA sense, a device has a specific set of properties related specifically to RDMA and the libibverbs API is designed around this notion. We can't totally get rid of it, if we change the kernel communication then 'device' has to be emulated somehow in userspace. I don't know what you mean by 'device the way it is designed now' - what infrastructure are you talking about? > > So, if you want to move fully to the per-char-dev model then I think > > we'd give up the global netdev like behaviors, things like > > listen(0.0.0) and output route selection, and so forth. I doubt there > > is any support for that. > > Can the official listen() syscall be made to work over > infiniband devices? That would be best maybe? I have no idea if this is feasible, AFAIK nobody is looking into that option. If we do that, then we get a fd out of the standard scheme, what is that FD? How does it get linked back to the actual driver ioctl interface? > I think in general one does the connection initiation via TCP and IP > protocol regardless... So really infiniband does only matter as the > underlying protocol over which we have imposed IP semantics via IPoIB. No, it is mostly all done over native protocols, not IP. IP is just used for ARP and the routing table. > > If we go the other way to a full netdev-like module then we give up > > fine grained (currently mildly broken) file system permissions. > > Maybe go with a device semantic and not with full netdev because this is > not a classic packet based network. Well, what do you mean? We are emulating netdev and strongly linking ipoib netdev devices to the RDMA infrastructure - that is the mismatch I keep talking about. > > You haven't explained how we can mesh the rdma_cm, netdev-like > > listen(0.0.0.0) type semantics, continue to implement multi-port APM > > functionality, share PDs across ports, etc, etc. These are all the > > actual things done today that break when we drop the multiplexors. > > I am not not *the* expert on this. Frankly this whole RDMA request stuff > is not that interesting. The basic thing that the RDMA API needs to do for > my use case is fast messaging bypassing the kernel. And having gazillion > of special ioctls on the site is not that productive. Can we please reuse > the standard system calls and ioctls as much as possible? I know it is not interesting for you, but this is the majority use model for RDMA, so it has to be the main purpose for the design. Your DPDK-like use case really has little need for most of the RDMA API. > No idea what you mean by multiport "APMs". There is an obvius way to > aggreate devices by creating a new one like done in the storage subsystem. That APM is a special RDMA feature for fast fail over and recovery, it has dedicated hardware support and at least with our current API cannot be modeled with device stacking. It is functionally different from teaming/bonding like we see in ethernet. > Sharing PDs? Those are from the same address space using multiple devices. > It would be natural to share that in such a case since they are more bound > to the memory layout of a single process and not so much to the devices. > So PDs could be per process instead of per device. I generally agree that we got this upside down, Devices/ports should have been created under PDs. But the fact remains, the hardware we have is very restricted on how PD resources can be shared between ports. Today only ports on the same RDMA device can share a PD. That is why we have the upside down model.. So, when I talk about sharing a PD, I mean in the sense that an app cannot just create a single PD and use that for all RDMA. It currently needs a PD per RDMA device. I don't know if anyone is thinking this is a pain point. > Yes please simplify this sprawl as much as possible. Follow standard > convention instead of reinvention things like device aggregation. I can't help that IBTA created a different scheme for device aggregation, addressing, and basically everything else. 'standard convention' in the kernel really just means ethernet, and this isn't like ethernet, except superficially. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html