On Fri, May 18, 2018 at 03:03:47PM +0200, Roman Pen wrote: > Hi all, > > This is v2 of series, which introduces IBNBD/IBTRS modules. > > This cover letter is split on three parts: > > 1. Introduction, which almost repeats everything from previous cover > letters. > 2. Changelog. > 3. Performance measurements on linux-4.17.0-rc2 and on two different > Mellanox cards: ConnectX-2 and ConnectX-3 and CPUs: Intel and AMD. > > > Introduction > > IBTRS (InfiniBand Transport) is a reliable high speed transport library > which allows for establishing connection between client and server > machines via RDMA. It is optimized to transfer (read/write) IO blocks > in the sense that it follows the BIO semantics of providing the > possibility to either write data from a scatter-gather list to the > remote side or to request ("read") data transfer from the remote side > into a given set of buffers. > > IBTRS is multipath capalbdke and provides I/O fail-over and load-balancing > functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA > CMs and particular path is selected according to the load-balancing policy. > > IBNBD (InfiniBand Network Block Device) is a pair of kernel modules > (client and server) that allow for remote access of a block device on > the server over IBTRS protocol. After being mapped, the remote block > devices can be accessed on the client side as local block devices. > Internally IBNBD uses IBTRS as an RDMA transport library. > > Why? > > - IBNBD/IBTRS is developed in order to map thin provisioned volumes, > thus internal protocol is simple. > - IBTRS was developed as an independent RDMA transport library, which > supports fail-over and load-balancing policies using multipath, thus > it can be used for any other IO needs rather than only for block > device. > - IBNBD/IBTRS is faster than NVME over RDMA. > Old comparison results: > https://www.spinics.net/lists/linux-rdma/msg48799.html > New comparison results: see performance measurements section below. > > Key features of IBTRS transport library and IBNBD block device: > > o High throughput and low latency due to: > - Only two RDMA messages per IO. > - IMM InfiniBand messages on responses to reduce round trip latency. > - Simplified memory management: memory allocation happens once on > server side when IBTRS session is established. > > o IO fail-over and load-balancing by using multipath. According to > our test loads additional path brings ~20% of bandwidth. > > o Simple configuration of IBNBD: > - Server side is completely passive: volumes do not need to be > explicitly exported. > - Only IB port GID and device path needed on client side to map > a block device. > - A device is remapped automatically i.e. after storage reboot. > > Commits for kernel can be found here: > https://github.com/profitbricks/ibnbd/commits/linux-4.17-rc2 > > The out-of-tree modules are here: > https://github.com/profitbricks/ibnbd/ > > Vault 2017 presentation: > http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf I think from the RDMA side, before we accept something like this, I'd like to hear from Christoph, Chuck or Sagi that the dataplane implementation of this is correct, eg it uses the MRs properly and invalidates at the right time, sequences with dma_ops as required, etc. They all have done this work on their ULPs and it was tricky, I don't want to see another ULP implement this wrong.. I'm skeptical here already due to the performance numbers - they are not really what I'd expects and we may find that invalidate changes will bring the performance down further. Jason