Re: [PATCH v2] block: BFQ default for single queue devices

Jens Axboe <axboe@xxxxxxxxx> · Tue, 16 Oct 2018 11:35:59 -0600

On 10/15/18 1:44 PM, Paolo Valente wrote:
> 
> 
>> Il giorno 15 ott 2018, alle ore 21:26, Jens Axboe <axboe@xxxxxxxxx> ha scritto:
>>
>> On 10/15/18 12:26 PM, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 15 ott 2018, alle ore 17:39, Jens Axboe <axboe@xxxxxxxxx> ha scritto:
>>>>
>>>> On 10/15/18 8:10 AM, Linus Walleij wrote:
>>>>> This sets BFQ as the default scheduler for single queue
>>>>> block devices (nr_hw_queues == 1) if it is available. This
>>>>> affects notably MMC/SD-cards but also UBI and the loopback
>>>>> device.
>>>>>
>>>>> I have been running it for a while without any negative
>>>>> effects on my pet systems and I want some wider testing
>>>>> so let's throw it out there and see what people say.
>>>>> Admittedly my use cases are limited. I need to keep this
>>>>> patch around for my personal needs anyway.
>>>>>
>>>>> We take special care to avoid using BFQ on zoned devices
>>>>> (in particular SMR, shingled magnetic recording devices)
>>>>> as these currently require mq-deadline to group writes
>>>>> together.
>>>>>
>>>>> I have opted against introducing any default scheduler
>>>>> through Kconfig as the mq-deadline enforcement for
>>>>> zoned devices has to be done at runtime anyways and
>>>>> too many config options will make things confusing.
>>>>>
>>>>> My argument for setting a default policy in the kernel
>>>>> as opposed to user space is the "reasonable defaults"
>>>>> type, analogous to how we have one default CPU scheduling
>>>>> policy (CFS) that make most sense for most tasks, and
>>>>> how automatic process group scheduling happens in most
>>>>> distributions without userspace involvement. The BFQ
>>>>> scheduling policy makes most sense for single hardware
>>>>> queue devices and many embedded systems will not have
>>>>> the clever userspace tools (such as udev) to make an
>>>>> educated choice of scheduling policy. Defaults should be
>>>>> those that make most sense for the hardware.
>>>>
>>>> I still don't like this. There are going to be tons of
>>>> cases where the single queue device is some hw raid setup
>>>> or similar, where performance is going to be much worse with
>>>> BFQ than it is with mq-deadline, for instance. That's just
>>>> one case.
>>>>
>>>
>>> Hi Jens,
>>> in my RAID tests bfq performed as well as in non-RAID tests.  Probably
>>> you refer to the fact that, in a RAID configuration, IOPS can become
>>> very high.  But, if that is the case, then the response to your
>>> objections already emerged in the previous thread.  Let me sum it up
>>> again.
>>>
>>> I tested bfq on virtually every device in the range from few hundred
>>> of IOPS to 50-100KIOPS.  Then, through the public script I already
>>> mentioned, I found the maximum number of IOPS that bfq can handle:
>>> about 400K with a commodity CPU.
>>>
>>> In particular, in all my tests with real hardware, bfq
>>> - is not even comparable to that of any of the other scheduler, in
>>>  terms of responsiveness, latency for real-time applications, ability
>>>  to provide strong bandwidth guarantees, ability to boost throughput
>>>  while guaranteeing bandwidths;
>>> - is a little worse than the other scheduler for only one test, on
>>>  only some hardware: total throughput with random reads, were it may
>>>  lose up to 10-15% of throughput.  Of course, the scheduler that reach
>>>  a higher throughput leave the machine unusable during the test.
>>>
>>> So I really cannot see a reason why bfq could do worse than any of
>>> these other schedulers for some single-queue device (conservatively)
>>> below 300KIOPS.
>>>
>>> Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
>>> probably less than 1% of all the single-queue storage around (USB
>>> drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
>>> are sacrificing 99% of the hardware, to help 1% of the hardware, for
>>> one kind of test cases.
>>
>> I should have been more clear - I'm not worried about IOPS overhead,
>> I'm worried about scheduling decisions that lower performance on
>> (for instance) raid composed of many drives (rotational or otherwise).
>>
>> If you have actual data (on what hardware, and what kind of tests)
>> to disprove that worry, then that's great, and I'd love to see that.
>>
> 
> Here are some old results with a very simple configuration:
> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
> http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
> 
> Then I stopped repeating tests that always yielded the same good results.
> 
> As for more professional systems, a well-known company doing
> real-time packet-traffic dumping asked me to modify bfq so as to
> guarantee lossless data writing also during queries.  The involved box
> had a RAID reaching a few Gbps, and everything worked well.
> 
> Anyway, if you have specific issues in mind, I can check more deeply.

Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.

I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.

-- 
Jens Axboe

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/