Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

Bart Van Assche <Bart.VanAssche@xxxxxxx> · Wed, 7 Feb 2018 16:35:18 +0000

On Wed, 2018-02-07 at 13:57 +0100, Roman Penyaev wrote:
> On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche <Bart.VanAssche@xxxxxxx> wrote:
> > On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
> > Something else I would like to understand better is how much of the latency
> > gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
> > protocol. Was e.g. support for immediate data present in the NVMeOF and/or
> > SRP drivers used on your test setup?
> 
> I did not get the question. IBTRS uses empty messages with only imm_data
> field set to respond on IO. This is a part of the IBTRS protocol.  I do
> not understand how can immediate data be present in other drivers, if
> those do not use it in their protocols.  I am lost here.

With "immediate data" I was referring to including the entire write buffer
in the write PDU itself. See e.g. the enable_imm_data kernel module parameter
of the ib_srp-backport driver. See also the use of SRP_DATA_DESC_IMM in the
SCST ib_srpt target driver. Neither the upstream SRP initiator nor the upstream
SRP target support immediate data today. However, sending that code upstream
is on my to-do list.

For the upstream NVMeOF initiator and target drivers, see also the call of
nvme_rdma_map_sg_inline() in nvme_rdma_map_data().

> > Are you aware that the NVMeOF target driver calls page_alloc() from the hot path but that there are plans to
> > avoid these calls in the hot path by using a caching mechanism similar to
> > the SGV cache in SCST? Are you aware that a significant latency reduction
> > can be achieved by changing the SCST SGV cache from a global into a per-CPU
> > cache?
> 
> No, I am not aware. That is nice, that there is a lot of room for performance
> tweaks. I will definitely retest on fresh kernel once everything is done on
> nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys).

Recently the functions sgl_alloc() and sgl_free() were introduced in the upstream
kernel (these will be included in kernel v4.16). The NVMe target driver, LIO and
several other drivers have been modified to use these functions instead of their
own copy of that function. The next step is to replace these function calls by
calls to functions that perform cached allocations.

> > Regarding the SRP measurements: have you tried to set the
> > never_register kernel module parameter to true? I'm asking this because I
> > think that mode is most similar to how the IBNBD initiator driver works.
> 
> yes, according to my notes from that link (frankly, I do not remember,
> but that is what I wrote 1 year ago):
> 
>     * Where suffixes mean:
> 
>      _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
>               with 'register_always=N' param
> 
> That what you are asking, right?

Not really. With register_always=Y memory registration is always used by the
SRP initiator, even if the data can be coalesced into a single sg entry. With
register_always=N memory registration is only performed if multiple sg entries
are needed to describe the data. And with never_register=Y memory registration
is not used even if multiple sg entries are needed to describe the data buffer.

Thanks,

Bart.