Hi Paolo, Thanks a lot for your time and help. I reset all the parameters for bfq and am now running with this one change from the defaults. - low_latency is set to 0. When tested with our applications, the IOPRIO_CLASS_RT (doing direct I/O) task has no preference over the IOPRIO_CLASS_BE task (that was using the kernel page cache) When tested with dd, the IOPRIO_CLASS_RT task (with direct I/O) was getting about 1/4th the bandwidth the IOPRIO_CLASS_BE task (with buffered I/O) was getting. - RT task with direct I/O: ionice -c1 -n2 /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT bs=2M oflag=direct - BE task : /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT bs=2M Please let me know if you want me to test more Paolo. Will look forward to hearing from you. Thanks, Madhav. On Fri, Dec 14, 2018 at 12:20 PM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote: > > > > > Il giorno 14 dic 2018, alle ore 14:55, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto: > > > > Hi Paolo, > > > > Thanks a lot for your work and your response to this email. > > > > Thank you for trying bfq! > > > Following your advise, I switched my real time application to direct I/O. > > We now have > > > > Task1: Using ionice -c1, we run a RT IO/ O_DIRECT task that writes > > to flood the NVME drive to its capacity. > > Task2: We run a normal (best-effort) IO/ asyn(page-cache buffered) > > task that writes to flood the NVME drive to its capacity. > > > > What we now see is that Task2 still ends up getting about 3/5th or > > more of the NVMe bandwidth and Task1 ends up getting the rest of the > > NVMe disk bandwidth. Could the kernel threads/buffering be > > overpowering the RT priority of Task1? > > > > What we desire is to loosely ensure that Task1 gets as much > > bandwidth as it asks for in any iteration while Task2 and the > > remaining tasks share the leftover bandwidth. > > > > We are currently testing with these settings in BFQ. > > low_latency = 0 (to control the bandwidth allocation) > > This may be a good idea. But, after we sort this out, you could try > leaving low_latency enabled. It should cause no harm (I'm trying to > make bfq smarter and smarter). > > > > slice_idle = 0 (we are using a fast NVMe and if Task1 does not > > have any requests and control goes to Task2, it seems to make sense we > > get the control back to Task1 quickly) > > Unfortunately, this is not correct. Setting slice_idle to 0 means > losing completely control on I/O. And in you case you need ... > control :) > > So, you'd better leave slice_idle as it is. > > The actual problem is that, while thinking on whether it was > reasonable to set slice_idle == 0, I realized that bfq is likely to fail > even with slice_idle > 0. In fact, I have not checked the respect of > RT priority for probably a few years. And I have changed lots of > critical operations in these years. > > So, I'll wait for your feedback, which I do expect to be negative. > Then, if it is actually negative, I'll work on the bug behind the > failure. > > > timeout_sync = 1 (Task1 does application level buffering and sends > > the biggest chunk of data available (in high MB's) always) > > > > Also for this one, things should go well even without touching it. > But we will find out after discovering whether bfq is broken for your > use case. > > Looking forward to your feedback, > Paolo > > > We are unable to make the leap to cpugroups at this time Paolo. Is > > there anything we can tune in BFQ or change the way we generate > > traffic in Task1 to ensure that Task1 gets the bandwidth it asks for? > > > > A rough approximation to the Task1 and Task2 traffic we discussed > > above seem to be these instantiations of dd. > > Task1: ionice -c1 -n2 /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT1 > > bs=2M oflag=direct > > Task2: /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT2 bs=2M > > > > Thanks again Paolo, > > Madhav. > > > > > > > > On Fri, Dec 14, 2018 at 1:38 AM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote: > >> > >> > >> > >>> Il giorno 13 dic 2018, alle ore 21:34, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto: > >>> > >>> In our setup, we have a task that writes to a NVMe SSD drive using the > >>> page cache. (using ::write os calls). This task does application level > >>> buffering and sends big (large MBs) chunks of data to ::write call. > >>> Each instance of the task writes upto 10Gbps of data to the NVME SSD. > >>> > >>> We run two instances of this task as below. > >>> Instance 1: Using ionice -c1, we run a RT IO instance of this task. > >>> Instance 2: We run a normal (best-effort) IO instance of this task. > >>> > >>> Both the write task instances compete for NVMe bandwidth. We observe > >>> that BFQ allocates equal bandwidth to both the task instances starting > >>> a few seconds after they start up. > >>> > >>> What we expected is that Instance1 (IOPRIO_CLASS_RT scheduling class) > >>> will be granted all the bandwidth it asked for while Instance2 will be > >>> allowed to consume the remaining bandwidth. > >>> > >>> Could you please help us understand how we may be able to design to > >>> get our expected behavior. > >>> > >> > >> Hi, > >> if you do async, in-memory writes, then your task instances just dirty > >> vm pages. Then different processes, the kworkers, will do the > >> writeback of dirty pages asynchronously, according to the system > >> writeback logic and configuration. kworkers have their own priority, > >> which is likely to be the same for each such process. AFAICT this > >> priority is not related to the priority you give to your processes. > >> > >> If you want to control I/O bandwidths for writes, go for direct I/O or > >> use cgroups. In case of cgroups, consider that there is still the > >> oddity that bfq interface parameters are non standard. We have > >> proposed and are pushing for a solution to this problem [1]. > >> > >> Thanks, > >> Paolo > >> > >> [1] https://lkml.org/lkml/2018/11/19/366 > >> > >>> Thanks > >> >