Re: Unintuitive scheduling results using BFQ

Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> · Fri, 14 Dec 2018 13:25:56 -0500

Hi Paolo,

    Thanks a lot for your time and help.

    I reset all the parameters for bfq and am now running with this
one change from the defaults.
    - low_latency is set to 0.

    When tested with our applications, the IOPRIO_CLASS_RT (doing
direct I/O) task has no preference over the IOPRIO_CLASS_BE task (that
was using the kernel page cache)

    When tested with dd, the IOPRIO_CLASS_RT task (with direct I/O)
was getting about 1/4th the bandwidth the IOPRIO_CLASS_BE task (with
buffered I/O) was getting.
        - RT task with direct I/O: ionice -c1 -n2 /bin/dd if=/dev/zero
of=/n/ssd1/ddtestRT bs=2M oflag=direct
        - BE task                       : /bin/dd if=/dev/zero
of=/n/ssd1/ddtestRT bs=2M

    Please let me know if you want me to test more Paolo.
    Will look forward to hearing from you.

Thanks,
Madhav.

On Fri, Dec 14, 2018 at 12:20 PM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote:
>
>
>
> > Il giorno 14 dic 2018, alle ore 14:55, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto:
> >
> > Hi Paolo,
> >
> >    Thanks a lot for your work and your response to this email.
> >
>
> Thank you for trying bfq!
>
> >    Following your advise, I switched my real time application to direct I/O.
> >    We now have
> >
> >    Task1: Using ionice -c1, we run a RT IO/ O_DIRECT task that writes
> > to flood the NVME drive to its capacity.
> >    Task2: We run a normal (best-effort) IO/ asyn(page-cache buffered)
> > task that writes to flood the NVME drive to its capacity.
> >
> >    What we now see is that Task2 still ends up getting about 3/5th or
> > more of the NVMe bandwidth and Task1 ends up getting the rest of the
> > NVMe disk bandwidth. Could the kernel threads/buffering be
> > overpowering the RT priority of Task1?
> >
> >    What we desire is to loosely ensure that Task1 gets as much
> > bandwidth as it asks for in any iteration while Task2 and the
> > remaining tasks share the leftover bandwidth.
> >
> >    We are currently testing with these settings in BFQ.
> >    low_latency = 0 (to control the bandwidth allocation)
>
> This may be a good idea.  But, after we sort this out, you could try
> leaving low_latency enabled.  It should cause no harm (I'm trying to
> make bfq smarter and smarter).
>
>
> >    slice_idle = 0     (we are using a fast NVMe and if Task1 does not
> > have any requests and control goes to Task2, it seems to make sense we
> > get the control back to Task1 quickly)
>
> Unfortunately, this is not correct.  Setting slice_idle to 0 means
> losing completely control on I/O.  And in you case you need ...
> control :)
>
> So, you'd better leave slice_idle as it is.
>
> The actual problem is that, while thinking on whether it was
> reasonable to set slice_idle == 0, I realized that bfq is likely to fail
> even with slice_idle > 0.  In fact, I have not checked the respect of
> RT priority for probably a few years.  And I have changed lots of
> critical operations in these years.
>
> So, I'll wait for your feedback, which I do expect to be negative.
> Then, if it is actually negative, I'll work on the bug behind the
> failure.
>
> >    timeout_sync = 1 (Task1 does application level buffering and sends
> > the biggest chunk of data available (in high MB's) always)
> >
>
> Also for this one, things should go well even without touching it.
> But we will find out after discovering whether bfq is broken for your
> use case.
>
> Looking forward to your feedback,
> Paolo
>
> >    We are unable to make the leap to cpugroups at this time Paolo. Is
> > there anything we can tune in BFQ or change the way we generate
> > traffic in Task1 to ensure that Task1 gets the bandwidth it asks for?
> >
> >    A rough approximation to the Task1 and Task2 traffic we discussed
> > above seem to be these instantiations of dd.
> >    Task1: ionice -c1 -n2 /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT1
> > bs=2M oflag=direct
> >    Task2: /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT2   bs=2M
> >
> > Thanks again Paolo,
> > Madhav.
> >
> >
> >
> > On Fri, Dec 14, 2018 at 1:38 AM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >>> Il giorno 13 dic 2018, alle ore 21:34, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto:
> >>>
> >>> In our setup, we have a task that writes to a NVMe SSD drive using the
> >>> page cache. (using ::write os calls). This task does application level
> >>> buffering and sends big (large MBs) chunks of data to ::write call.
> >>> Each instance of the task writes upto 10Gbps of data to the NVME SSD.
> >>>
> >>> We run two instances of this task as below.
> >>> Instance 1: Using ionice -c1, we run a RT IO instance of this task.
> >>> Instance 2: We run a normal (best-effort) IO instance of this task.
> >>>
> >>> Both the write task instances compete for NVMe bandwidth. We observe
> >>> that BFQ allocates equal bandwidth to both the task instances starting
> >>> a few seconds after they start up.
> >>>
> >>> What we expected is that Instance1 (IOPRIO_CLASS_RT scheduling class)
> >>> will be granted all the bandwidth it asked for while Instance2 will be
> >>> allowed to consume the remaining bandwidth.
> >>>
> >>> Could you please help us understand how we may be able to design to
> >>> get our expected behavior.
> >>>
> >>
> >> Hi,
> >> if you do async, in-memory writes, then your task instances just dirty
> >> vm pages.  Then different processes, the kworkers, will do the
> >> writeback of dirty pages asynchronously, according to the system
> >> writeback logic and configuration.  kworkers have their own priority,
> >> which is likely to be the same for each such process.  AFAICT this
> >> priority is not related to the priority you give to your processes.
> >>
> >> If you want to control I/O bandwidths for writes, go for direct I/O or
> >> use cgroups.  In case of cgroups, consider that there is still the
> >> oddity that bfq interface parameters are non standard.  We have
> >> proposed and are pushing for a solution to this problem [1].
> >>
> >> Thanks,
> >> Paolo
> >>
> >> [1] https://lkml.org/lkml/2018/11/19/366
> >>
> >>> Thanks
> >>
>