Re: [PATCH V4] block: optimize for small block size IO

Jens Axboe <axboe@xxxxxxxxx> · Mon, 4 Nov 2019 20:33:32 -0700

On 11/4/19 8:14 PM, Kent Overstreet wrote:
> On Mon, Nov 04, 2019 at 07:38:42PM -0700, Jens Axboe wrote:
>> This is where my knee jerk at the initial "partial completions" and
>> "should be trivial" start to kick in. I don't think they are
>> necessarily hard, but they aren't free either. And you'd need to be
>> paying that atomic_dec cost for every IO.
> 
> No need - you added the code to avoid that atomic dec for bi_remaining
> in the common case, the same approach will work here.

I guess that would work for the common case of not splitting. If we split,
then it's OK to pay a higher cost. We would have anyway, with the
existing code.

>> currently have to do, maybe not... If it's a clear win, then it'd be
>> an interesting path to pursue. But we probably won't have that answer
>> until at least a hacky version is done as proof of concept.
>>
>> On the upside, it'd simplify things to just have the mapping in one
>> place, when the request is setup. Though until all drivers do that
>> (which I worry will be never), then we'd be stuck with both. Maybe
>> that's a bit to pessimistic, should be easier now since we just have
>> blk-mq.
> 
> blk_rq_map_sg isn't called from _that_ many places, I suspect once
> it's figured out for one driver the rest won't be that bad.

It's definitely easier than it would have been, most things are pretty
streamlined now with the blk-mq conversion. And the ones that don't call
blk_rq_map_sg() usually don't do DMA on the requests. Sizes tend to be
more arbitrary there, and not hard boundaries.

> And even if some drivers remain unconverted, I personally _much_
> prefer this approach to more special case fast paths, and I bet this
> approach will be faster anyways.
> 
> Also - regarding driver allocating of the sglists, I think most high
> performance drivers preallocate a pool of sglists that are all sized
> to what the device is capable of.

They do, but most of them probably also assume on sg list per request.
We'd have one request now instead of multiple, so either serializing it
(which would definitely suck for some common cases), or doing something
very funky with mapping to multiple requests at the same time.

But I don't think we should argue this much more. If someone wants to do
the work to make this work, even a prototype, it's much better to argue
over actual code then potential issues and wins now.

-- 
Jens Axboe