Re: [PATCH 5/5] IB/srp: Optimize completion queue polling

David Dillow <dave@xxxxxxxxxxxxxx> · Tue, 08 Jul 2014 23:22:00 -0700

On Tue, 2014-07-08 at 15:49 +0200, Bart Van Assche wrote:
> Thanks for digging up this information and also for sharing it.

Sure thing; it's a bummer that something in the email must have tickled
vger's taboo filters...

>  This is interesting. What I noticed is that the in the SRP target
> driver attached to the previous e-mail ("srptest.c") one command at a
> time is processed. However, in the SRP target driver I ran my own
> tests with (based on SCST) multiple SCSI commands are processed
> simultaneously by a single thread. A finite state machine is
> associated with each SCSI command and events like IB work completions
> trigger transitions of that state machine. So that might be a possible
> explanation why my measurement results were different.

True, I expect to see a difference between those models, and I suspect
we were also measuring different values. I was looking at IOPS, and that
I saw a drop when adding the batching (as well as blk-iopoll) tells me
that the test driver was not the limiting factor -- it did not change
between them.

I'll readily agree that it is not a "real" load -- it purposefully does
not do any RDMA reads or writes; it just responds to the request as fast
as possible. The fact that it is single threaded shouldn't matter in
this context -- the IB streaming benchmarks are also single threaded and
can do many more operations per second.

I'll also admit that perhaps your threaded target bunches up responses a
bit, so perhaps the initiator sees enough work to make the batching
worth it for your tests. It'd be a bit surprising, but not overly so.

Regardless, it'd be nice to see your numbers for this change, and if you
have time to run it against my test target and post those for
comparison, that'd be swell. I no longer have much in the way of access
to relevant IB hardware -- just some SDR and perhaps a few DDR mthca
cards at home, and none are currently in machines.

> However, before I repost (a variant of) this patch I will try to find
> a way to combine the advantages of interrupt-based processing (low
> latency) and the blk-iopoll approach (minimal time spent in interrupt
> context).

Have you run an idle-soaking process like zc (by Andrew Morton, back in
the days of sendfile) to measure the CPU usage of the different
approaches? Those would be some interesting numbers in conjunction with
the latency and IOPs numbers -- it would give everyone a picture of the
trade-offs involved.

Thanks, and I may have plenty of questions, I'm glad you're looking at
this.
Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html