Re: NFSD generic R/W API (sendto path) performance results

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 17 Nov 2016 15:42:37 -0500

> On Nov 17, 2016, at 3:20 PM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
> 
> 
>>>> Also did you try to always register for > max_sge
>>>> calls?  The code can already register all segments with the
>>>> rdma_rw_force_mr module option, so it would only need a small tweak for
>>>> that behavior.
>>> 
>>> For various reasons I decided the design should build one WR chain for
>>> each RDMA segment provided by the client. Good clients expose just
>>> one RDMA segment for the whole NFS READ payload.
>>> 
>>> Does force_mr make the generic API use FRWR with RDMA Write? I had
>>> assumed it changed only the behavior with RDMA Read. I'll try that
>>> too, if RDMA Write can easily be made to use FRWR.
>> 
>> Unfortunately, some RPC replies are formed from two or three
>> discontiguous buffers. The gap test in ib_sg_to_pages returns
>> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
>> fails.
>> 
>> Thus with my current prototype I'm not able to test with FRWR.
>> 
>> I could fix this in my prototype, but it would be nicer for me if
>> rdma_rw_init_ctx handled this case the same for FRWR as it does
>> for physical addressing, which doesn't seem to have any problem
>> with a discontiguous SGL.
>> 
>> 
>>> But I'd like a better explanation for this result. Could be a bug
>>> in my implementation, my design, or in the driver. Continuing to
>>> investigate.
> 
> Hi Chuck, sorry for the late reply (have been busy lately..)
> 
> I think that the Call-to-first-Write phenomenon you are seeing makes
> perfect sense, the question is, is a QD=1 1M transfers latency that
> interesting? Did you see a positive effect on small (say 4k) transfers?
> both latency and iops scalability should be able to improve especially
> when serving multiple clients.
> 
> If indeed you feel that this is an interesting workload to optimize, I
> think we can come up with something.

I believe 1MB transfers are interesting: NFS is frequently used in
back-up scenarios, for example, and believe it or not, also for
non-linear editing and animation (4K video).

QD=1 exposes the individual components of latency. In this case, we
can clearly see the cost of preparing the data payload for transfer.
It's basically a tweak so we can debug the problem.

In the "Real World" I expect to see larger transfers, where several
1MB I/Os are dispatched in parallel. I don't reach fabric bandwidth
until 10 or more are in flight, which I think should be improved.

Wrt 4KB, I didn't see much change there, though I admit I didn't
expect much change since both cases have to DMA map a page, and post
one Write WR, so I haven't looked too closely. I'm already down
around 32us for a 4KB NFS READ, even without the server changes,
and 30us for 4KB NFS READ all-inline.

I agree, though, that there is a 3:1 reduction in ib_post_send calls
with the generic API in my test harness, and that can only be a good
thing.

> About the First-to-last-Write, thats weird, and sound like a bug
> somewhere. Maybe Mellanox folks can tell us if splitting 1M to multiple
> writes works better (although I cannot comprehend why).
> 
> Question, are the send and receive cqs still in IB_POLL_SOFTIRQ mode?

Yes, both are still in SOFTIRQ. A while back I played with using the
Work Queue mode, and noticed some slowdowns when the Send CQ was
changed to use Work Queue. I didn't explore it further.

Someday, these will need to be changed, so that the server runs
entirely in process context (as it does for TCP, AFAICT). That gets
rid of BH flapping, but adds heavy-weight context switching latencies.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html