Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 24 Jul 2015 13:46:05 -0400

On Jul 24, 2015, at 12:26 PM, Jason Gunthorpe <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, Jul 24, 2015 at 10:36:07AM -0400, Chuck Lever wrote:
> 
>> Unfinished, but operational:
>> 
>> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfs-rdma-future
> 
> Nice..
> 
> Can you spend some time and reflect on how some of this could be
> lowered into the core code?

The point of the prototype is to start thinking about this with
actual data. :-) So I’m with you.

> The FMR and FRWR side have many
> similarities now..

>> FRWR is seeing a 10-15% throughput reduction with 8-thread dbench,
>> but a 5% improvement with 16-thread fio IOPS. 4K and 8K direct
>> read and write are negatively impacted.
> 
> I'm not surprised since invalidate is sync. I belive you need to
> incorporate SEND WITH INVALIDATE to substantially recover this
> overhead.

I tried to find another kernel ULP using SEND WITH INVALIDATE, but
I didn’t see one. I assume you mean the NFS server would use this
WR when replying, to knock down the RPC’s client MRs remotely?

> It would be neat if the RQ could continue to advance while waiting for
> the invalidate.. That looks almost doable..

The new reply handling work queue is not restricted to serial reply
processing. Unlike the tasklet model, multiple RPC replies can be
processed at once, and can run across all CPUs.

The tasklet was global, shared across all RPC/RDMA receive queues on
that client. AFAICT there is very little else that is shared between
RPC replies.

I think using a work queue instead may be a tiny bit slower for each
RPC (perhaps due to additional context switching), but will allow
much better scaling with the number of transports and mount points
the client creates.

I may not have understood your comment.

>> I converted the RPC reply handler tasklet to a work queue context
>> to allow sleeping. A new .ro_unmap_sync method is invoked after
>> the RPC/RDMA header is parsed but before xprt_complete_rqst()
>> wakes up the waiting RPC.
> 
> .. so the issue is the RPC must be substantially parsed to learn which
> MR it is associated with to schedule the invalidate? 

Only the RPC/RDMA header has to be parsed, but yes. The needed
parsing is handled in rpcrdma_reply_handler right before the
.ro_unmap_unsync call.

Parsing the RPC reply results is then done by the upper layer
once xprt_complete_rqst() has run.

>> This is actually much more efficient than the current logic,
>> which serially does an ib_unmap_fmr() for each MR the RPC owns.
>> So FMR overall performs better with this change.
> 
> Interesting..
> 
>> Because the next RPC cannot awaken until the last send completes,
>> send queue accounting is based on RPC/RDMA credit flow control.
> 
> So for FRWR the sync invalidate effectively guarentees all SQEs
> related to this RPC are flushed. That seems reasonable, if the number
> of SQEs and CQEs are properly sized in relation to the RPC slot count
> it should be workable..

Yes, both queues are sized in rpcrdma_ep_create() according to
the slot count / credit limit.

> How does FMR and PHYS synchronize?

We still rely on timing there.

The RPC's send buffer may be re-allocated by the next RPC
if that RPC wants to send a bigger request than this one. Thus
there is still a tiny but non-zero risk the HCA may not be
done with that send buffer. Closing that hole is still on my
to-do list.

>> I’m sure there are some details here that still need to be
>> addressed, but this fixes the big problem with FRWR send queue
>> accounting, which was that LOCAL_INV WRs would continue to
>> consume SQEs while another RPC was allowed to start.
> 
> Did you test without that artificial limit you mentioned before?

Yes. No problems now, the limit is removed in the last patch
in that series.

> I'm also wondering about this:
> 
>> During some other testing I found that when a completion upcall
>> returns to the provider leaving CQEs still on the completion queue,
>> there is a non-zero probability that a completion will be lost.
> 
> What does lost mean?

Lost means a WC in the CQ is skipped by ib_poll_cq().

In other words, I expected that during the next upcall,
ib_poll_cq() would return WCs that were not processed, starting
with the last one on the CQ when my upcall handler returned.

I found this by intentionally having the completion handler
process only one or two WCs and then return.

> The CQ is edge triggered, so if you don't drain it you might not get
> another timely CQ callback (which is bad), but CQEs themselves should
> not be lost.

I’m not sure I fully understand this problem, it might
even be my misuderstanding about ib_poll_cq(). But forcing
the completion upcall handler to completely drain the CQ
during each upcall prevents the issue.

(Note, I don’t think fixing this is a pre-requisite for
the synchronous invalidate work, but it just happened
to be in the patch queue).

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html