Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

Sagi Grimberg <sagi@xxxxxxxxxxx> · Mon, 5 Feb 2018 14:16:03 +0200

Hi Roman and the team,

On 02/02/2018 04:08 PM, Roman Pen wrote:
This series introduces IBNBD/IBTRS modules.

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA.

So its not strictly infiniband correct?

 It is optimized to transfer (read/write) IO blocks
in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.

IBTRS is multipath capable and provides I/O fail-over and load-balancing
functionality.

Couple of questions on your multipath implementation?
1. What was your main objective over dm-multipath?
2. What was the consideration of this implementation over
creating a stand-alone bio based device node to reinject the
bio to the original block device?

IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.

Why?

    - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
      thus internal protocol is simple and consists of several request
	 types only without awareness of underlaying hardware devices.

Can you explain how the protocol is developed for thin-p? What are the
essence of how its suited for it?

    - IBTRS was developed as an independent RDMA transport library, which
      supports fail-over and load-balancing policies using multipath, thus
	 it can be used for any other IO needs rather than only for block
	 device.

What do you mean by "any other IO"?

    - IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
      https://www.spinics.net/lists/linux-rdma/msg48799.html
      (I retested on latest 4.14 kernel - there is no any significant
	  difference, thus I post the old link).

That is interesting to learn.

Reading your reference brings a couple of questions though,
- Its unclear to me how ibnbd performs reads without performing memory
  registration. Is it using the global dma rkey?

- Its unclear to me how there is a difference in noreg in writes,
  because for small writes nvme-rdma never register memory (it uses
  inline data).

- Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that
  seems considerably low against other reports. Can you try and explain
  what was the bottleneck? This can be a potential bug and I (and the
  rest of the community is interesting in knowing more details).

- srp/scst comparison is really not fair having it in legacy request
  mode. Can you please repeat it and report a bug to either linux-rdma
  or to the scst mailing list?

- Your latency measurements are surprisingly high for a null target
  device (even for low end nvme device actually) regardless of the
  transport implementation.

For example:
- QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
  fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
  and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
  latency I got ~14 us. So something does not add up here. If this is
  not some configuration issue, then we have serious bugs to handle..

- QD=16 the read latencies are > 10ms for null devices?! I'm having
  troubles understanding how you were able to get such high latencies
  (> 100 ms for QD>=100)

Can you share more information about your setup? It would really help
us understand more.

    - Major parts of the code were rewritten, simplified and overall code
      size was reduced by a quarter.

That is good to know.