Re: [PATCH V4] block: optimize for small block size IO

Jens Axboe <axboe@xxxxxxxxx> · Mon, 4 Nov 2019 19:38:42 -0700

On 11/4/19 7:30 PM, Kent Overstreet wrote:
> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote:
>> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote:
>>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote:
>>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote:
>>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote:
>>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote:
>>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrote:
>>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote:
>>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote:
>>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(such as
>>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio includes
>>>>>>>>>> multiple page. And only consider to try splitting this bio in case
>>>>>>>>>> that the multiple page flag is set.
>>>>>>>>>
>>>>>>>>> So, back in the day I had an alternative approach in mind: get rid of
>>>>>>>>> blk_queue_split entirely, by pushing splitting down to the request layer - when
>>>>>>>>> we map the bio/request to sgl, just have it map as much as will fit in the sgl
>>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it on the request
>>>>>>>>> queue.
>>>>>>>>>
>>>>>>>>> This would mean there'd be no need for counting segments at all, and would cut a
>>>>>>>>> fair amount of code out of the io path.
>>>>>>>>
>>>>>>>> I thought about that to, but it will take a lot more effort.  Mostly
>>>>>>>> because md/dm heavily rely on splitting as well.  I still think it is
>>>>>>>> worthwhile, it will just take a significant amount of time and we
>>>>>>>> should have the quick improvement now.
>>>>>>>
>>>>>>> We can do it one driver at a time - driver sets a flag to disable
>>>>>>> blk_queue_split(). Obvious one to do first would be nvme since that's where it
>>>>>>> shows up the most.
>>>>>>>
>>>>>>> And md/md do splitting internally, but I'm not so sure they need
>>>>>>> blk_queue_split().
>>>>>>
>>>>>> I'm a big proponent of doing something like that instead, but it is a
>>>>>> lot of work. I absolutely hate the splitting we're doing now, even
>>>>>> though the original "let's work as hard as we add add page time to get
>>>>>> things right" was pretty abysmal as well.
>>>>>
>>>>> Last I looked I don't think it was going to be that bad, just needed a bit of
>>>>> finesse. We just need to be able to partially process a request in e.g.
>>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only map as much as
>>>>> will fit instead of popping an assertion.
>>>>
>>>> I think it may not be doable.
>>>>
>>>> blk_rq_map_sg() is called by drivers and has to work on single request, however
>>>> more requests have to be involved if we delay the splitting to blk_rq_map_sg().
>>>> Cause splitting means that two bios can't be submitted in single IO request.
>>>
>>> Of course it's doable, do I have to show you how?
>>
>> No, you don't have to, could you just point out where my above words is wrong?
> 
> blk_rq_map_sg() _currently_ works on a single request, but as I said
> from the start that this would involve changing it to only process as
> much of a request as would fit on an sglist.
> 
> Drivers will have to be modified, but the changes to driver code
> should be pretty easy. What will be slightly trickier will be changing
> blk-mq to handle requests that are only partially completed; that will
> be harder than it would have been before blk-mq, since the old request
> queue code used to handle partially completed requests - not much work
> would have to be done that code.
> 
> I'm not very familiar with the blk-mq code, so Jens would be better
> qualified to say how best to change that code. The basic idea would
> probably be the same as how bios how have a refcount - bi_remaining -
> to track splits/completions. If requests (in blk-mq land) don't have
> such a refcount (they don't appear to), it will have to be added.
> 
>  From a quick glance, blk_mq_complete_request() is where the refcount
>  put will have to be added. I haven't found where requests are popped
>  off the request queue in blk-mq land yet - the code will have to be
>  changed to only do that once the request has been fully mapped and
>  submitted by the driver.

This is where my knee jerk at the initial "partial completions" and
"should be trivial" start to kick in. I don't think they are necessarily
hard, but they aren't free either. And you'd need to be paying that
atomic_dec cost for every IO. Maybe that's cheaper than the work we
currently have to do, maybe not... If it's a clear win, then it'd be an
interesting path to pursue. But we probably won't have that answer until
at least a hacky version is done as proof of concept.

On the upside, it'd simplify things to just have the mapping in one
place, when the request is setup. Though until all drivers do that
(which I worry will be never), then we'd be stuck with both. Maybe
that's a bit to pessimistic, should be easier now since we just have
blk-mq.

-- 
Jens Axboe