This series introduces IBNBD/IBTRS kernel modules. IBNBD (InfiniBand network block device) allows for an RDMA transfer of block IO over InfiniBand network. The driver presents itself as a block device on client side and transmits the block requests in a zero-copy fashion to the server-side via InfiniBand. The server part of the driver converts the incoming buffers back into BIOs and hands them down to the underlying block device. As soon as IO responses come back from the drive, they are being transmitted back to the client.
Hi Jack, Danil and Roman, I met Danil and Roman last week at Vault, and I think you guys are awesome, thanks a lot for open-sourcing your work! However, I have a couple of issues here, some are related to the code and some are fundamental actually. - Is there room for this ibnbd? If we were to take every block driver that was submitted without sufficient justification, it'd be very hard to maintain. What advantage (if any) does this buy anyone over existing rdma based protocols (srp, iser, nvmf)? I'm really (*really*) not sold on this one... - To me, the fundamental design that the client side owns a pool of buffers that it issues writes too, seems inferior than the one taken in iser/nvmf (immediate data). IMO, the ibnbd design has scalability issues both in terms of server side resources, client side contention and network congestion (on infiniband the latter is less severe). - I suggest that for your next post, you provide a real-life use-case where each of the existing drivers can't suffice, and by can't suffice I mean that it has a fundamental issue with it, not something that requires a fix. With that our feedback can be much more concrete and (at least on my behalf) more open to accept it. - I'm not exactly sure why you would suggest that your implementation supports only infiniband if you use rdma_cm for address resolution, nor I understand why you emphasize feature (2) below, nor why even in the presence of rdma_cm you have ibtrs_ib_path? (confused...) iWARP needs a bit more attention if you don't use the new generic interfaces though... - I honestly do not understand why you need *19057* LOC to implement a rdma based block driver. Thats almost larger than all of our existing block drivers combined... First glance at the code provides some explanations, (1) you have some strange code that has no business in a block driver like ibtrs_malloc/ibtrs_zalloc (yikes) or open-coding various existing logging routines, (2) you are for some reason adding a second tag allocation scheme (why?), (3) you are open coding a lot of stuff that we added to the stack in the past months... (4) you seem to over-layer your code for reasons that I do not really understand. And I didn't really look deep at all into the code, just to get the feel of it, and it seems like it needs a lot of work before it can even be considered upstream ready.
We design and implement this solution based on our need for Cloud Computing, the key features are: - High throughput and low latency due to: 1) Only two rdma messages per IO
Where exactly did you witnessed latency that was meaningful by having another rdma message on the wire? That's only for writes, anyway, and we have first data bursts for that..
2) Simplified client side server memory management 3) Eliminated SCSI sublayer
That's hardly an advantage given all we are losing without it... ... Cheers, Sagi.