Hi Paolo, Thanks a lot. The RT behavior you described is exactly how we would expect our RT priority task to behave. We are on 4.14. Will test and write back soon. Thanks, -Madhav. On Tue, Dec 18, 2018 at 1:04 PM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote: > > > > > Il giorno 14 dic 2018, alle ore 19:25, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto: > > > > Hi Paolo, > > > > Thanks a lot for your time and help. > > > > I reset all the parameters for bfq and am now running with this > > one change from the defaults. > > - low_latency is set to 0. > > > > When tested with our applications, the IOPRIO_CLASS_RT (doing > > direct I/O) task has no preference over the IOPRIO_CLASS_BE task (that > > was using the kernel page cache) > > > > When tested with dd, the IOPRIO_CLASS_RT task (with direct I/O) > > was getting about 1/4th the bandwidth the IOPRIO_CLASS_BE task (with > > buffered I/O) was getting. > > - RT task with direct I/O: ionice -c1 -n2 /bin/dd if=/dev/zero > > of=/n/ssd1/ddtestRT bs=2M oflag=direct > > - BE task : /bin/dd if=/dev/zero > > of=/n/ssd1/ddtestRT bs=2M > > > > Please let me know if you want me to test more Paolo. > > Will look forward to hearing from you. > > > > Ok, doubt confirmed: bfq didn't check properly whether mixed-class > processes were doing I/O. I've attached a tentative fix as a > compressed file (to prevent my client from corrupting it). You can > also find the same diffs pasted below, if you want to have a look and > comment. > > The patch should apply cleanly on top of current master. > > Let me know if it works. > > In particular, consider the following two facts about your RT process > (i.e., your process with RT I/O-priority class): > 1) When served in parallel with other I/O, your RT process cannot > however reach a higher throughput than it reaches when served alone! > I'm making this silly remark because you may get confused by the fact > that modern, parallel drives usually reach a higher and higher > throughput as the number of processes doing I/O grows. > 2) If your RT process does I/O greedily, then it will monopolize the > drive after this patch (if this patch works :) ). This implies that, > even if you have many processes doing I/O in parallel with your RT > process, your drive will serve almost only your RT process while the > latter is active. Therefore parallelism may be reduced, and the total > throughput of the drive may be lower than that reached without I/O > control (I'm working on reducing this loss too, but it is really a > non-trivial task). > > Here are the changes. > > block/bfq-iosched.c | 86 ++++++++++++++++++++++++++++------------------------- > block/bfq-iosched.h | 8 +++-- > block/bfq-wf2q.c | 12 ++++++-- > 3 files changed, 59 insertions(+), 47 deletions(-) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index 97337214bec4..d8d7a32ce64b 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -623,26 +623,6 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) > bfqq->pos_root = NULL; > } > > -/* > - * Tell whether there are active queues with different weights or > - * active groups. > - */ > -static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd) > -{ > - /* > - * For queue weights to differ, queue_weights_tree must contain > - * at least two nodes. > - */ > - return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) && > - (bfqd->queue_weights_tree.rb_node->rb_left || > - bfqd->queue_weights_tree.rb_node->rb_right) > -#ifdef CONFIG_BFQ_GROUP_IOSCHED > - ) || > - (bfqd->num_groups_with_pending_reqs > 0 > -#endif > - ); > -} > - > /* > * The following function returns true if every queue must receive the > * same share of the throughput (this condition is used when deciding > @@ -651,25 +631,48 @@ static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd) > * > * Such a scenario occurs when: > * 1) all active queues have the same weight, > - * 2) all active groups at the same level in the groups tree have the same > - * weight, > + * 2) all active queues belong to the same I/O-priority class, > * 3) all active groups at the same level in the groups tree have the same > + * weight, > + * 4) all active groups at the same level in the groups tree have the same > * number of children. > * > * Unfortunately, keeping the necessary state for evaluating exactly > * the last two symmetry sub-conditions above would be quite complex > - * and time consuming. Therefore this function evaluates, instead, > - * only the following stronger two sub-conditions, for which it is > + * and time consuming. Therefore this function evaluates, instead, > + * only the following stronger three sub-conditions, for which it is > * much easier to maintain the needed state: > * 1) all active queues have the same weight, > - * 2) there are no active groups. > + * 2) all active queues belong to the same I/O-priority class, > + * 3) there are no active groups. > * In particular, the last condition is always true if hierarchical > * support or the cgroups interface are not enabled, thus no state > * needs to be maintained in this case. > */ > static bool bfq_symmetric_scenario(struct bfq_data *bfqd) > { > - return !bfq_varied_queue_weights_or_active_groups(bfqd); > + /* > + * For queue weights to differ, queue_weights_tree must contain > + * at least two nodes. > + */ > + bool varied_queue_weights = !RB_EMPTY_ROOT(&bfqd->queue_weights_tree) && > + (bfqd->queue_weights_tree.rb_node->rb_left || > + bfqd->queue_weights_tree.rb_node->rb_right); > + > + bool multiple_classes_busy = > + (bfqd->busy_queues[0] & bfqd->busy_queues[1]) | > + (bfqd->busy_queues[0] & bfqd->busy_queues[2]) | > + (bfqd->busy_queues[1] & bfqd->busy_queues[2]); > + > + /* > + * For queue weights to differ, queue_weights_tree must contain > + * at least two nodes. > + */ > + return !(varied_queue_weights || multiple_classes_busy > +#ifdef BFQ_GROUP_IOSCHED_ENABLED > + || bfqd->num_groups_with_pending_reqs > 0 > +#endif > + ); > } > > /* > @@ -728,15 +731,14 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, > /* > * In the unlucky event of an allocation failure, we just > * exit. This will cause the weight of queue to not be > - * considered in bfq_varied_queue_weights_or_active_groups, > - * which, in its turn, causes the scenario to be deemed > - * wrongly symmetric in case bfqq's weight would have been > - * the only weight making the scenario asymmetric. On the > - * bright side, no unbalance will however occur when bfqq > - * becomes inactive again (the invocation of this function > - * is triggered by an activation of queue). In fact, > - * bfq_weights_tree_remove does nothing if > - * !bfqq->weight_counter. > + * considered in bfq_symmetric_scenario, which, in its turn, > + * causes the scenario to be deemed wrongly symmetric in case > + * bfqq's weight would have been the only weight making the > + * scenario asymmetric. On the bright side, no unbalance will > + * however occur when bfqq becomes inactive again (the > + * invocation of this function is triggered by an activation > + * of queue). In fact, bfq_weights_tree_remove does nothing > + * if !bfqq->weight_counter. > */ > if (unlikely(!bfqq->weight_counter)) > return; > @@ -2217,7 +2219,7 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq, > return NULL; > > /* If there is only one backlogged queue, don't search. */ > - if (bfqd->busy_queues == 1) > + if (bfq_tot_busy_queues(bfqd) == 1) > return NULL; > > in_service_bfqq = bfqd->in_service_queue; > @@ -3655,7 +3657,8 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq) > * the requests already queued in the device have been served. > */ > asymmetric_scenario = (bfqq->wr_coeff > 1 && > - bfqd->wr_busy_queues < bfqd->busy_queues) || > + bfqd->wr_busy_queues < > + bfq_tot_busy_queues(bfqd)) || > !bfq_symmetric_scenario(bfqd); > > /* > @@ -3934,7 +3937,7 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd, > * belongs to CLASS_IDLE and other queues are waiting for > * service. > */ > - if (!(bfqd->busy_queues > 1 && bfq_class_idle(bfqq))) > + if (!(bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq))) > goto return_rq; > > bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED); > @@ -3952,7 +3955,7 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx) > * most a call to dispatch for nothing > */ > return !list_empty_careful(&bfqd->dispatch) || > - bfqd->busy_queues > 0; > + bfq_tot_busy_queues(bfqd) > 0; > } > > static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) > @@ -4006,9 +4009,10 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) > goto start_rq; > } > > - bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues); > + bfq_log(bfqd, "dispatch requests: %d busy queues", > + bfq_tot_busy_queues(bfqd)); > > - if (bfqd->busy_queues == 0) > + if (bfq_tot_busy_queues(bfqd) == 0) > goto exit; > > /* > diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h > index 0b02bf302de0..30be669be465 100644 > --- a/block/bfq-iosched.h > +++ b/block/bfq-iosched.h > @@ -501,10 +501,11 @@ struct bfq_data { > unsigned int num_groups_with_pending_reqs; > > /* > - * Number of bfq_queues containing requests (including the > - * queue in service, even if it is idling). > + * Per-class (RT, BE, IDLE) number of bfq_queues containing > + * requests (including the queue in service, even if it is > + * idling). > */ > - int busy_queues; > + unsigned int busy_queues[3]; > /* number of weight-raised busy @bfq_queues */ > int wr_busy_queues; > /* number of queued requests */ > @@ -974,6 +975,7 @@ extern struct blkcg_policy blkcg_policy_bfq; > > struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq); > struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity); > +unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd); > struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity); > struct bfq_entity *bfq_entity_of(struct rb_node *node); > unsigned short bfq_ioprio_to_weight(int ioprio); > diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c > index 63e0f12be7c9..7119207cae3c 100644 > --- a/block/bfq-wf2q.c > +++ b/block/bfq-wf2q.c > @@ -44,6 +44,12 @@ static unsigned int bfq_class_idx(struct bfq_entity *entity) > BFQ_DEFAULT_GRP_CLASS - 1; > } > > +unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd) > +{ > + return bfqd->busy_queues[0] + bfqd->busy_queues[1] + > + bfqd->busy_queues[2]; > +} > + > static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd, > bool expiration); > > @@ -1514,7 +1520,7 @@ struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd) > struct bfq_sched_data *sd; > struct bfq_queue *bfqq; > > - if (bfqd->busy_queues == 0) > + if (bfq_tot_busy_queues(bfqd) == 0) > return NULL; > > /* > @@ -1666,7 +1672,7 @@ void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq, > > bfq_clear_bfqq_busy(bfqq); > > - bfqd->busy_queues--; > + bfqd->busy_queues[bfqq->ioprio_class - 1]--; > > if (!bfqq->dispatched) > bfq_weights_tree_remove(bfqd, bfqq); > @@ -1689,7 +1695,7 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq) > bfq_activate_bfqq(bfqd, bfqq); > > bfq_mark_bfqq_busy(bfqq); > - bfqd->busy_queues++; > + bfqd->busy_queues[bfqq->ioprio_class - 1]++; > > if (!bfqq->dispatched) > if (bfqq->wr_coeff == 1) > > Thanks, > Paolo > > > > > Thanks, > > Madhav. > > > > > > On Fri, Dec 14, 2018 at 12:20 PM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote: > >> > >> > >> > >>> Il giorno 14 dic 2018, alle ore 14:55, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto: > >>> > >>> Hi Paolo, > >>> > >>> Thanks a lot for your work and your response to this email. > >>> > >> > >> Thank you for trying bfq! > >> > >>> Following your advise, I switched my real time application to direct I/O. > >>> We now have > >>> > >>> Task1: Using ionice -c1, we run a RT IO/ O_DIRECT task that writes > >>> to flood the NVME drive to its capacity. > >>> Task2: We run a normal (best-effort) IO/ asyn(page-cache buffered) > >>> task that writes to flood the NVME drive to its capacity. > >>> > >>> What we now see is that Task2 still ends up getting about 3/5th or > >>> more of the NVMe bandwidth and Task1 ends up getting the rest of the > >>> NVMe disk bandwidth. Could the kernel threads/buffering be > >>> overpowering the RT priority of Task1? > >>> > >>> What we desire is to loosely ensure that Task1 gets as much > >>> bandwidth as it asks for in any iteration while Task2 and the > >>> remaining tasks share the leftover bandwidth. > >>> > >>> We are currently testing with these settings in BFQ. > >>> low_latency = 0 (to control the bandwidth allocation) > >> > >> This may be a good idea. But, after we sort this out, you could try > >> leaving low_latency enabled. It should cause no harm (I'm trying to > >> make bfq smarter and smarter). > >> > >> > >>> slice_idle = 0 (we are using a fast NVMe and if Task1 does not > >>> have any requests and control goes to Task2, it seems to make sense we > >>> get the control back to Task1 quickly) > >> > >> Unfortunately, this is not correct. Setting slice_idle to 0 means > >> losing completely control on I/O. And in you case you need ... > >> control :) > >> > >> So, you'd better leave slice_idle as it is. > >> > >> The actual problem is that, while thinking on whether it was > >> reasonable to set slice_idle == 0, I realized that bfq is likely to fail > >> even with slice_idle > 0. In fact, I have not checked the respect of > >> RT priority for probably a few years. And I have changed lots of > >> critical operations in these years. > >> > >> So, I'll wait for your feedback, which I do expect to be negative. > >> Then, if it is actually negative, I'll work on the bug behind the > >> failure. > >> > >>> timeout_sync = 1 (Task1 does application level buffering and sends > >>> the biggest chunk of data available (in high MB's) always) > >>> > >> > >> Also for this one, things should go well even without touching it. > >> But we will find out after discovering whether bfq is broken for your > >> use case. > >> > >> Looking forward to your feedback, > >> Paolo > >> > >>> We are unable to make the leap to cpugroups at this time Paolo. Is > >>> there anything we can tune in BFQ or change the way we generate > >>> traffic in Task1 to ensure that Task1 gets the bandwidth it asks for? > >>> > >>> A rough approximation to the Task1 and Task2 traffic we discussed > >>> above seem to be these instantiations of dd. > >>> Task1: ionice -c1 -n2 /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT1 > >>> bs=2M oflag=direct > >>> Task2: /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT2 bs=2M > >>> > >>> Thanks again Paolo, > >>> Madhav. > >>> > >>> > >>> > >>> On Fri, Dec 14, 2018 at 1:38 AM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>>> Il giorno 13 dic 2018, alle ore 21:34, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto: > >>>>> > >>>>> In our setup, we have a task that writes to a NVMe SSD drive using the > >>>>> page cache. (using ::write os calls). This task does application level > >>>>> buffering and sends big (large MBs) chunks of data to ::write call. > >>>>> Each instance of the task writes upto 10Gbps of data to the NVME SSD. > >>>>> > >>>>> We run two instances of this task as below. > >>>>> Instance 1: Using ionice -c1, we run a RT IO instance of this task. > >>>>> Instance 2: We run a normal (best-effort) IO instance of this task. > >>>>> > >>>>> Both the write task instances compete for NVMe bandwidth. We observe > >>>>> that BFQ allocates equal bandwidth to both the task instances starting > >>>>> a few seconds after they start up. > >>>>> > >>>>> What we expected is that Instance1 (IOPRIO_CLASS_RT scheduling class) > >>>>> will be granted all the bandwidth it asked for while Instance2 will be > >>>>> allowed to consume the remaining bandwidth. > >>>>> > >>>>> Could you please help us understand how we may be able to design to > >>>>> get our expected behavior. > >>>>> > >>>> > >>>> Hi, > >>>> if you do async, in-memory writes, then your task instances just dirty > >>>> vm pages. Then different processes, the kworkers, will do the > >>>> writeback of dirty pages asynchronously, according to the system > >>>> writeback logic and configuration. kworkers have their own priority, > >>>> which is likely to be the same for each such process. AFAICT this > >>>> priority is not related to the priority you give to your processes. > >>>> > >>>> If you want to control I/O bandwidths for writes, go for direct I/O or > >>>> use cgroups. In case of cgroups, consider that there is still the > >>>> oddity that bfq interface parameters are non standard. We have > >>>> proposed and are pushing for a solution to this problem [1]. > >>>> > >>>> Thanks, > >>>> Paolo > >>>> > >>>> [1] https://lkml.org/lkml/2018/11/19/366 > >>>> > >>>>> Thanks > >>>> > >> >