Re: [PATCH] block: check more requests for multiple_queues in blk_attempt_plug_merge

Jens Axboe <axboe@xxxxxxxxx> · Thu, 10 Mar 2022 18:21:33 -0700

On 3/10/22 6:14 PM, Ming Lei wrote:
> On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote:
>> On 3/10/22 5:31 PM, Song Liu wrote:
>>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>
>>>> On 3/10/22 4:33 PM, Song Liu wrote:
>>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>>>
>>>>>> On 3/10/22 3:37 PM, Song Liu wrote:
>>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote:
>>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
>>>>>>>>> If we only check the previous entry for merge attempt, many merge will be
>>>>>>>>> missed. As a result, significant regression is observed for RAID check
>>>>>>>>> and repair.
>>>>>>>>>
>>>>>>>>> Fix this by checking more than just the previous entry when
>>>>>>>>> plug->multiple_queues == true.
>>>>>>>>>
>>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
>>>>>>>>> 103 MB/s.
>>>>>>>>
>>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why
>>>>>>>> the merges aren't being done there, would be trivial when the list is
>>>>>>>> flushed out. Because if the perf difference is that big, then other
>>>>>>>> workloads would be suffering they are that sensitive to being within a
>>>>>>>> plug worth of IO.
>>>>>>>
>>>>>>> The disks have mq-deadline by default. I also tried kyber, the result
>>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a
>>>>>>> round-robin manner. If we only check the previous request, there isn't
>>>>>>> much opportunity for merge. I guess other workloads may have different
>>>>>>> behavior?
>>>>>>
>>>>>> Round robin one at the time? I feel like there's something odd or
>>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug
>>>>>> merging.
>>>>>
>>>>> It is not one request at a time, but more like (for raid456):
>>>>>    read 4kB from HDD1, HDD2, HDD3...,
>>>>>    then read another 4kB from HDD1, HDD2, HDD3, ...
>>>>
>>>> Ehm, that very much looks like one-at-the-time from each drive, which is
>>>> pretty much the worst way to do it :-)
>>>>
>>>> Is there a reason for that? Why isn't it using 64k chunks or something
>>>> like that? You could still do that as a kind of read-ahead, even if
>>>> you're still processing in chunks of 4k.
>>>
>>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
>>> every HDD in the array. AFAICT, we need some non-trivial change to
>>> enable the read ahead.
>>
>> Right, you'd need to stick some sort of caching in between so instead of
>> reading 4k directly, you ask the cache for 4k and that can manage
>> read-ahead.
>>
>>>>>> Plug merging is mainly meant to reduce the overhead of merging,
>>>>>> complement what the scheduler would do. If there's a big drop in
>>>>>> performance just by not getting as efficient merging on the plug side,
>>>>>> that points to an issue with something else.
>>>>>
>>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to
>>>>> merge at plug side, so I guess the behavior has been like this for a
>>>>> long time. I will take a look at the scheduler side and see whether we
>>>>> can just merge later, but I am not very optimistic about it.
>>>>
>>>> Yeah I remember, and that also kind of felt like a work-around for some
>>>> underlying issue. Maybe there's something about how the IO is issued
>>>> that makes it go straight to disk and we never get any merging? Is it
>>>> because they are sync reads?
>>>>
>>>> In any case, just doing larger reads would likely help quite a bit, but
>>>> would still be nice to get to the bottom of why we're not seeing the
>>>> level of merging we expect.
>>>
>>> Let me look more into this. Maybe we messed something up in the
>>> scheduler.
>>
>> I'm assuming you have a plug setup for doing the reads, which is why you
>> see the big difference (or there would be none). But
>> blk_mq_flush_plug_list() should really take care of this when the plug
>> is flushed, requests should be merged at that point. And from your
>> description, doesn't sound like they are at all.
> 
> requests are shared, when running out of request, plug list will be
> flushed early.

That is true, but I don't think that's the problem here with the round
robin approach. Seems like it'd drive a pretty low queue depth, even
considering SATA.

> Maybe we can just put bios into plug list directly, then handle them
> all in blk_mq_flush_plug_list.

That might not be a bad idea regardless. And should be trivial now, with
the plug list being singly linked anyway.

-- 
Jens Axboe