Re: Unintuitive scheduling results using BFQ

Paolo Valente <paolo.valente@xxxxxxxxxx> · Fri, 14 Dec 2018 18:19:58 +0100

> Il giorno 14 dic 2018, alle ore 14:55, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto:
> 
> Hi Paolo,
> 
>    Thanks a lot for your work and your response to this email.
> 

Thank you for trying bfq!

>    Following your advise, I switched my real time application to direct I/O.
>    We now have
> 
>    Task1: Using ionice -c1, we run a RT IO/ O_DIRECT task that writes
> to flood the NVME drive to its capacity.
>    Task2: We run a normal (best-effort) IO/ asyn(page-cache buffered)
> task that writes to flood the NVME drive to its capacity.
> 
>    What we now see is that Task2 still ends up getting about 3/5th or
> more of the NVMe bandwidth and Task1 ends up getting the rest of the
> NVMe disk bandwidth. Could the kernel threads/buffering be
> overpowering the RT priority of Task1?
> 
>    What we desire is to loosely ensure that Task1 gets as much
> bandwidth as it asks for in any iteration while Task2 and the
> remaining tasks share the leftover bandwidth.
> 
>    We are currently testing with these settings in BFQ.
>    low_latency = 0 (to control the bandwidth allocation)

This may be a good idea.  But, after we sort this out, you could try
leaving low_latency enabled.  It should cause no harm (I'm trying to
make bfq smarter and smarter).

>    slice_idle = 0     (we are using a fast NVMe and if Task1 does not
> have any requests and control goes to Task2, it seems to make sense we
> get the control back to Task1 quickly)

Unfortunately, this is not correct.  Setting slice_idle to 0 means
losing completely control on I/O.  And in you case you need ...
control :)

So, you'd better leave slice_idle as it is.

The actual problem is that, while thinking on whether it was
reasonable to set slice_idle == 0, I realized that bfq is likely to fail
even with slice_idle > 0.  In fact, I have not checked the respect of
RT priority for probably a few years.  And I have changed lots of
critical operations in these years.

So, I'll wait for your feedback, which I do expect to be negative.
Then, if it is actually negative, I'll work on the bug behind the
failure.

>    timeout_sync = 1 (Task1 does application level buffering and sends
> the biggest chunk of data available (in high MB's) always)
> 

Also for this one, things should go well even without touching it.
But we will find out after discovering whether bfq is broken for your
use case.

Looking forward to your feedback,
Paolo

>    We are unable to make the leap to cpugroups at this time Paolo. Is
> there anything we can tune in BFQ or change the way we generate
> traffic in Task1 to ensure that Task1 gets the bandwidth it asks for?
> 
>    A rough approximation to the Task1 and Task2 traffic we discussed
> above seem to be these instantiations of dd.
>    Task1: ionice -c1 -n2 /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT1
> bs=2M oflag=direct
>    Task2: /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT2   bs=2M
> 
> Thanks again Paolo,
> Madhav.
> 
> 
> 
> On Fri, Dec 14, 2018 at 1:38 AM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>>> Il giorno 13 dic 2018, alle ore 21:34, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto:
>>> 
>>> In our setup, we have a task that writes to a NVMe SSD drive using the
>>> page cache. (using ::write os calls). This task does application level
>>> buffering and sends big (large MBs) chunks of data to ::write call.
>>> Each instance of the task writes upto 10Gbps of data to the NVME SSD.
>>> 
>>> We run two instances of this task as below.
>>> Instance 1: Using ionice -c1, we run a RT IO instance of this task.
>>> Instance 2: We run a normal (best-effort) IO instance of this task.
>>> 
>>> Both the write task instances compete for NVMe bandwidth. We observe
>>> that BFQ allocates equal bandwidth to both the task instances starting
>>> a few seconds after they start up.
>>> 
>>> What we expected is that Instance1 (IOPRIO_CLASS_RT scheduling class)
>>> will be granted all the bandwidth it asked for while Instance2 will be
>>> allowed to consume the remaining bandwidth.
>>> 
>>> Could you please help us understand how we may be able to design to
>>> get our expected behavior.
>>> 
>> 
>> Hi,
>> if you do async, in-memory writes, then your task instances just dirty
>> vm pages.  Then different processes, the kworkers, will do the
>> writeback of dirty pages asynchronously, according to the system
>> writeback logic and configuration.  kworkers have their own priority,
>> which is likely to be the same for each such process.  AFAICT this
>> priority is not related to the priority you give to your processes.
>> 
>> If you want to control I/O bandwidths for writes, go for direct I/O or
>> use cgroups.  In case of cgroups, consider that there is still the
>> oddity that bfq interface parameters are non standard.  We have
>> proposed and are pushing for a solution to this problem [1].
>> 
>> Thanks,
>> Paolo
>> 
>> [1] https://lkml.org/lkml/2018/11/19/366
>> 
>>> Thanks
>>