Hi Jens, this is based against next, and meant for 4.20, if we are not too late again (and, of course, if the patch is acceptable). Thanks, Paolo > Il giorno 12 ott 2018, alle ore 11:55, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto: > > From: Federico Motta <federico@xxxxxxxxx> > > bfq defines as asymmetric a scenario where an active entity, say E > (representing either a single bfq_queue or a group of other entities), > has a higher weight than some other entities. If the entity E does sync > I/O in such a scenario, then bfq plugs the dispatch of the I/O of the > other entities in the following situation: E is in service but > temporarily has no pending I/O request. In fact, without this plugging, > all the times that E stops being temporarily idle, it may find the > internal queues of the storage device already filled with an > out-of-control number of extra requests, from other entities. So E may > have to wait for the service of these extra requests, before finally > having its own requests served. This may easily break service > guarantees, with E getting less than its fair share of the device > throughput. Usually, the end result is that E gets the same fraction of > the throughput as the other entities, instead of getting more, according > to its higher weight. > > Yet there are two other more subtle cases where E, even if its weight is > actually equal to or even lower than the weight of any other active > entities, may get less than its fair share of the throughput in case the > above I/O plugging is not performed: > 1. other entities issue larger requests than E; > 2. other entities contain more active child entities than E (or in > general tend to have more backlog than E). > > In the first case, other entities may get more service than E because > they get larger requests, than those of E, served during the temporary > idle periods of E. In the second case, other entities get more service > because, by having many child entities, they have many requests ready > for dispatching while E is temporarily idle. > > This commit addresses this issue by extending the definition of > asymmetric scenario: a scenario is asymmetric when > - active entities representing bfq_queues have differentiated weights, > as in the original definition > or (inclusive) > - one or more entities representing groups of entities are active. > > This broader definition makes sure that I/O plugging will be performed > in all the above cases, provided that there is at least one active > group. Of course, this definition is very coarse, so it will trigger > I/O plugging also in cases where it is not needed, such as, e.g., > multiple active entities with just one child each, and all with the same > I/O-request size. The reason for this coarse definition is just that a > finer-grained definition would be rather heavy to compute. > > On the opposite end, even this new definition does not trigger I/O > plugging in all cases where there is no active group, and all bfq_queues > have the same weight. So, in these cases some unfairness may occur if > there are asymmetries in I/O-request sizes. We made this choice because > I/O plugging may lower throughput, and probably a user that has not > created any group cares more about throughput than about perfect > fairness. At any rate, as for possible applications that may care about > service guarantees, bfq already guarantees a high responsiveness and a > low latency to soft real-time applications automatically. > > Signed-off-by: Federico Motta <federico@xxxxxxxxx> > Signed-off-by: Paolo Valente <paolo.valente@xxxxxxxxxx> > --- > block/bfq-iosched.c | 223 ++++++++++++++++++++++++-------------------- > block/bfq-iosched.h | 27 +++--- > block/bfq-wf2q.c | 36 +++---- > 3 files changed, 155 insertions(+), 131 deletions(-) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index 1a1b80dfd69d..6075100f03a5 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -624,12 +624,13 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) > } > > /* > - * Tell whether there are active queues or groups with differentiated weights. > + * Tell whether there are active queues with different weights or > + * active groups. > */ > -static bool bfq_differentiated_weights(struct bfq_data *bfqd) > +static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd) > { > /* > - * For weights to differ, at least one of the trees must contain > + * For queue weights to differ, queue_weights_tree must contain > * at least two nodes. > */ > return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) && > @@ -637,9 +638,7 @@ static bool bfq_differentiated_weights(struct bfq_data *bfqd) > bfqd->queue_weights_tree.rb_node->rb_right) > #ifdef CONFIG_BFQ_GROUP_IOSCHED > ) || > - (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) && > - (bfqd->group_weights_tree.rb_node->rb_left || > - bfqd->group_weights_tree.rb_node->rb_right) > + (bfqd->num_active_groups > 0 > #endif > ); > } > @@ -657,26 +656,25 @@ static bool bfq_differentiated_weights(struct bfq_data *bfqd) > * 3) all active groups at the same level in the groups tree have the same > * number of children. > * > - * Unfortunately, keeping the necessary state for evaluating exactly the > - * above symmetry conditions would be quite complex and time-consuming. > - * Therefore this function evaluates, instead, the following stronger > - * sub-conditions, for which it is much easier to maintain the needed > - * state: > + * Unfortunately, keeping the necessary state for evaluating exactly > + * the last two symmetry sub-conditions above would be quite complex > + * and time consuming. Therefore this function evaluates, instead, > + * only the following stronger two sub-conditions, for which it is > + * much easier to maintain the needed state: > * 1) all active queues have the same weight, > - * 2) all active groups have the same weight, > - * 3) all active groups have at most one active child each. > - * In particular, the last two conditions are always true if hierarchical > - * support and the cgroups interface are not enabled, thus no state needs > - * to be maintained in this case. > + * 2) there are no active groups. > + * In particular, the last condition is always true if hierarchical > + * support or the cgroups interface are not enabled, thus no state > + * needs to be maintained in this case. > */ > static bool bfq_symmetric_scenario(struct bfq_data *bfqd) > { > - return !bfq_differentiated_weights(bfqd); > + return !bfq_varied_queue_weights_or_active_groups(bfqd); > } > > /* > * If the weight-counter tree passed as input contains no counter for > - * the weight of the input entity, then add that counter; otherwise just > + * the weight of the input queue, then add that counter; otherwise just > * increment the existing counter. > * > * Note that weight-counter trees contain few nodes in mostly symmetric > @@ -687,25 +685,25 @@ static bool bfq_symmetric_scenario(struct bfq_data *bfqd) > * In most scenarios, the rate at which nodes are created/destroyed > * should be low too. > */ > -void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, > +void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, > struct rb_root *root) > { > + struct bfq_entity *entity = &bfqq->entity; > struct rb_node **new = &(root->rb_node), *parent = NULL; > > /* > - * Do not insert if the entity is already associated with a > + * Do not insert if the queue is already associated with a > * counter, which happens if: > - * 1) the entity is associated with a queue, > - * 2) a request arrival has caused the queue to become both > + * 1) a request arrival has caused the queue to become both > * non-weight-raised, and hence change its weight, and > * backlogged; in this respect, each of the two events > * causes an invocation of this function, > - * 3) this is the invocation of this function caused by the > + * 2) this is the invocation of this function caused by the > * second event. This second invocation is actually useless, > * and we handle this fact by exiting immediately. More > * efficient or clearer solutions might possibly be adopted. > */ > - if (entity->weight_counter) > + if (bfqq->weight_counter) > return; > > while (*new) { > @@ -715,7 +713,7 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, > parent = *new; > > if (entity->weight == __counter->weight) { > - entity->weight_counter = __counter; > + bfqq->weight_counter = __counter; > goto inc_counter; > } > if (entity->weight < __counter->weight) > @@ -724,66 +722,67 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, > new = &((*new)->rb_right); > } > > - entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter), > - GFP_ATOMIC); > + bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter), > + GFP_ATOMIC); > > /* > * In the unlucky event of an allocation failure, we just > - * exit. This will cause the weight of entity to not be > - * considered in bfq_differentiated_weights, which, in its > - * turn, causes the scenario to be deemed wrongly symmetric in > - * case entity's weight would have been the only weight making > - * the scenario asymmetric. On the bright side, no unbalance > - * will however occur when entity becomes inactive again (the > - * invocation of this function is triggered by an activation > - * of entity). In fact, bfq_weights_tree_remove does nothing > - * if !entity->weight_counter. > + * exit. This will cause the weight of queue to not be > + * considered in bfq_varied_queue_weights_or_active_groups, > + * which, in its turn, causes the scenario to be deemed > + * wrongly symmetric in case bfqq's weight would have been > + * the only weight making the scenario asymmetric. On the > + * bright side, no unbalance will however occur when bfqq > + * becomes inactive again (the invocation of this function > + * is triggered by an activation of queue). In fact, > + * bfq_weights_tree_remove does nothing if > + * !bfqq->weight_counter. > */ > - if (unlikely(!entity->weight_counter)) > + if (unlikely(!bfqq->weight_counter)) > return; > > - entity->weight_counter->weight = entity->weight; > - rb_link_node(&entity->weight_counter->weights_node, parent, new); > - rb_insert_color(&entity->weight_counter->weights_node, root); > + bfqq->weight_counter->weight = entity->weight; > + rb_link_node(&bfqq->weight_counter->weights_node, parent, new); > + rb_insert_color(&bfqq->weight_counter->weights_node, root); > > inc_counter: > - entity->weight_counter->num_active++; > + bfqq->weight_counter->num_active++; > } > > /* > - * Decrement the weight counter associated with the entity, and, if the > + * Decrement the weight counter associated with the queue, and, if the > * counter reaches 0, remove the counter from the tree. > * See the comments to the function bfq_weights_tree_add() for considerations > * about overhead. > */ > void __bfq_weights_tree_remove(struct bfq_data *bfqd, > - struct bfq_entity *entity, > + struct bfq_queue *bfqq, > struct rb_root *root) > { > - if (!entity->weight_counter) > + if (!bfqq->weight_counter) > return; > > - entity->weight_counter->num_active--; > - if (entity->weight_counter->num_active > 0) > + bfqq->weight_counter->num_active--; > + if (bfqq->weight_counter->num_active > 0) > goto reset_entity_pointer; > > - rb_erase(&entity->weight_counter->weights_node, root); > - kfree(entity->weight_counter); > + rb_erase(&bfqq->weight_counter->weights_node, root); > + kfree(bfqq->weight_counter); > > reset_entity_pointer: > - entity->weight_counter = NULL; > + bfqq->weight_counter = NULL; > } > > /* > - * Invoke __bfq_weights_tree_remove on bfqq and all its inactive > - * parent entities. > + * Invoke __bfq_weights_tree_remove on bfqq and decrement the number > + * of active groups for each queue's inactive parent entity. > */ > void bfq_weights_tree_remove(struct bfq_data *bfqd, > struct bfq_queue *bfqq) > { > struct bfq_entity *entity = bfqq->entity.parent; > > - __bfq_weights_tree_remove(bfqd, &bfqq->entity, > + __bfq_weights_tree_remove(bfqd, bfqq, > &bfqd->queue_weights_tree); > > for_each_entity(entity) { > @@ -797,17 +796,13 @@ void bfq_weights_tree_remove(struct bfq_data *bfqd, > * next_in_service for details on why > * in_service_entity must be checked too). > * > - * As a consequence, the weight of entity is > - * not to be removed. In addition, if entity > - * is active, then its parent entities are > - * active as well, and thus their weights are > - * not to be removed either. In the end, this > - * loop must stop here. > + * As a consequence, its parent entities are > + * active as well, and thus this loop must > + * stop here. > */ > break; > } > - __bfq_weights_tree_remove(bfqd, entity, > - &bfqd->group_weights_tree); > + bfqd->num_active_groups--; > } > } > > @@ -3506,9 +3501,11 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq) > * symmetric scenario where: > * (i) each of these processes must get the same throughput as > * the others; > - * (ii) all these processes have the same I/O pattern > - (either sequential or random). > - * In fact, in such a scenario, the drive will tend to treat > + * (ii) the I/O of each process has the same properties, in > + * terms of locality (sequential or random), direction > + * (reads or writes), request sizes, greediness > + * (from I/O-bound to sporadic), and so on. > + * In fact, in such a scenario, the drive tends to treat > * the requests of each of these processes in about the same > * way as the requests of the others, and thus to provide > * each of these processes with about the same throughput > @@ -3517,18 +3514,50 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq) > * certainly needed to guarantee that bfqq receives its > * assigned fraction of the device throughput (see [1] for > * details). > + * The problem is that idling may significantly reduce > + * throughput with certain combinations of types of I/O and > + * devices. An important example is sync random I/O, on flash > + * storage with command queueing. So, unless bfqq falls in the > + * above cases where idling also boosts throughput, it would > + * be important to check conditions (i) and (ii) accurately, > + * so as to avoid idling when not strictly needed for service > + * guarantees. > + * > + * Unfortunately, it is extremely difficult to thoroughly > + * check condition (ii). And, in case there are active groups, > + * it becomes very difficult to check condition (i) too. In > + * fact, if there are active groups, then, for condition (i) > + * to become false, it is enough that an active group contains > + * more active processes or sub-groups than some other active > + * group. We address this issue with the following bi-modal > + * behavior, implemented in the function > + * bfq_symmetric_scenario(). > * > - * We address this issue by controlling, actually, only the > - * symmetry sub-condition (i), i.e., provided that > - * sub-condition (i) holds, idling is not performed, > - * regardless of whether sub-condition (ii) holds. In other > - * words, only if sub-condition (i) holds, then idling is > + * If there are active groups, then the scenario is tagged as > + * asymmetric, conservatively, without checking any of the > + * conditions (i) and (ii). So the device is idled for bfqq. > + * This behavior matches also the fact that groups are created > + * exactly if controlling I/O (to preserve bandwidth and > + * latency guarantees) is a primary concern. > + * > + * On the opposite end, if there are no active groups, then > + * only condition (i) is actually controlled, i.e., provided > + * that condition (i) holds, idling is not performed, > + * regardless of whether condition (ii) holds. In other words, > + * only if condition (i) does not hold, then idling is > * allowed, and the device tends to be prevented from queueing > - * many requests, possibly of several processes. The reason > - * for not controlling also sub-condition (ii) is that we > - * exploit preemption to preserve guarantees in case of > - * symmetric scenarios, even if (ii) does not hold, as > - * explained in the next two paragraphs. > + * many requests, possibly of several processes. Since there > + * are no active groups, then, to control condition (i) it is > + * enough to check whether all active queues have the same > + * weight. > + * > + * Not checking condition (ii) evidently exposes bfqq to the > + * risk of getting less throughput than its fair share. > + * However, for queues with the same weight, a further > + * mechanism, preemption, mitigates or even eliminates this > + * problem. And it does so without consequences on overall > + * throughput. This mechanism and its benefits are explained > + * in the next three paragraphs. > * > * Even if a queue, say Q, is expired when it remains idle, Q > * can still preempt the new in-service queue if the next > @@ -3542,11 +3571,7 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq) > * idling allows the internal queues of the device to contain > * many requests, and thus to reorder requests, we can rather > * safely assume that the internal scheduler still preserves a > - * minimum of mid-term fairness. The motivation for using > - * preemption instead of idling is that, by not idling, > - * service guarantees are preserved without minimally > - * sacrificing throughput. In other words, both a high > - * throughput and its desired distribution are obtained. > + * minimum of mid-term fairness. > * > * More precisely, this preemption-based, idleless approach > * provides fairness in terms of IOPS, and not sectors per > @@ -3565,27 +3590,27 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq) > * 1024/8 times as high as the service received by the other > * queue. > * > - * On the other hand, device idling is performed, and thus > - * pure sector-domain guarantees are provided, for the > - * following queues, which are likely to need stronger > - * throughput guarantees: weight-raised queues, and queues > - * with a higher weight than other queues. When such queues > - * are active, sub-condition (i) is false, which triggers > - * device idling. > + * The motivation for using preemption instead of idling (for > + * queues with the same weight) is that, by not idling, > + * service guarantees are preserved (completely or at least in > + * part) without minimally sacrificing throughput. And, if > + * there is no active group, then the primary expectation for > + * this device is probably a high throughput. > * > - * According to the above considerations, the next variable is > - * true (only) if sub-condition (i) holds. To compute the > - * value of this variable, we not only use the return value of > - * the function bfq_symmetric_scenario(), but also check > - * whether bfqq is being weight-raised, because > - * bfq_symmetric_scenario() does not take into account also > - * weight-raised queues (see comments on > - * bfq_weights_tree_add()). In particular, if bfqq is being > - * weight-raised, it is important to idle only if there are > - * other, non-weight-raised queues that may steal throughput > - * to bfqq. Actually, we should be even more precise, and > - * differentiate between interactive weight raising and > - * soft real-time weight raising. > + * We are now left only with explaining the additional > + * compound condition that is checked below for deciding > + * whether the scenario is asymmetric. To explain this > + * compound condition, we need to add that the function > + * bfq_symmetric_scenario checks the weights of only > + * non-weight-raised queues, for efficiency reasons (see > + * comments on bfq_weights_tree_add()). Then the fact that > + * bfqq is weight-raised is checked explicitly here. More > + * precisely, the compound condition below takes into account > + * also the fact that, even if bfqq is being weight-raised, > + * the scenario is still symmetric if all active queues happen > + * to be weight-raised. Actually, we should be even more > + * precise here, and differentiate between interactive weight > + * raising and soft real-time weight raising. > * > * As a side note, it is worth considering that the above > * device-idling countermeasures may however fail in the > @@ -5392,7 +5417,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) > bfqd->idle_slice_timer.function = bfq_idle_slice_timer; > > bfqd->queue_weights_tree = RB_ROOT; > - bfqd->group_weights_tree = RB_ROOT; > + bfqd->num_active_groups = 0; > > INIT_LIST_HEAD(&bfqd->active_list); > INIT_LIST_HEAD(&bfqd->idle_list); > diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h > index 37d627afdc2e..77651d817ecd 100644 > --- a/block/bfq-iosched.h > +++ b/block/bfq-iosched.h > @@ -108,15 +108,14 @@ struct bfq_sched_data { > }; > > /** > - * struct bfq_weight_counter - counter of the number of all active entities > + * struct bfq_weight_counter - counter of the number of all active queues > * with a given weight. > */ > struct bfq_weight_counter { > - unsigned int weight; /* weight of the entities this counter refers to */ > - unsigned int num_active; /* nr of active entities with this weight */ > + unsigned int weight; /* weight of the queues this counter refers to */ > + unsigned int num_active; /* nr of active queues with this weight */ > /* > - * Weights tree member (see bfq_data's @queue_weights_tree and > - * @group_weights_tree) > + * Weights tree member (see bfq_data's @queue_weights_tree) > */ > struct rb_node weights_node; > }; > @@ -151,8 +150,6 @@ struct bfq_weight_counter { > struct bfq_entity { > /* service_tree member */ > struct rb_node rb_node; > - /* pointer to the weight counter associated with this entity */ > - struct bfq_weight_counter *weight_counter; > > /* > * Flag, true if the entity is on a tree (either the active or > @@ -266,6 +263,9 @@ struct bfq_queue { > /* entity representing this queue in the scheduler */ > struct bfq_entity entity; > > + /* pointer to the weight counter associated with this entity */ > + struct bfq_weight_counter *weight_counter; > + > /* maximum budget allowed from the feedback mechanism */ > int max_budget; > /* budget expiration (in jiffies) */ > @@ -449,14 +449,9 @@ struct bfq_data { > */ > struct rb_root queue_weights_tree; > /* > - * rbtree of non-queue @bfq_entity weight counters, sorted by > - * weight. Used to keep track of whether all @bfq_groups have > - * the same weight. The tree contains one counter for each > - * distinct weight associated to some active @bfq_group (see > - * the comments to the functions bfq_weights_tree_[add|remove] > - * for further details). > + * number of groups with requests still waiting for completion > */ > - struct rb_root group_weights_tree; > + unsigned int num_active_groups; > > /* > * Number of bfq_queues containing requests (including the > @@ -851,10 +846,10 @@ struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync); > void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync); > struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic); > void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq); > -void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, > +void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, > struct rb_root *root); > void __bfq_weights_tree_remove(struct bfq_data *bfqd, > - struct bfq_entity *entity, > + struct bfq_queue *bfqq, > struct rb_root *root); > void bfq_weights_tree_remove(struct bfq_data *bfqd, > struct bfq_queue *bfqq); > diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c > index ff7c2d470bb8..476b5a90a5a4 100644 > --- a/block/bfq-wf2q.c > +++ b/block/bfq-wf2q.c > @@ -788,25 +788,29 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st, > new_weight = entity->orig_weight * > (bfqq ? bfqq->wr_coeff : 1); > /* > - * If the weight of the entity changes, remove the entity > - * from its old weight counter (if there is a counter > - * associated with the entity), and add it to the counter > - * associated with its new weight. > + * If the weight of the entity changes, and the entity is a > + * queue, remove the entity from its old weight counter (if > + * there is a counter associated with the entity). > */ > if (prev_weight != new_weight) { > - root = bfqq ? &bfqd->queue_weights_tree : > - &bfqd->group_weights_tree; > - __bfq_weights_tree_remove(bfqd, entity, root); > + if (bfqq) { > + root = &bfqd->queue_weights_tree; > + __bfq_weights_tree_remove(bfqd, bfqq, root); > + } else > + bfqd->num_active_groups--; > } > entity->weight = new_weight; > /* > - * Add the entity to its weights tree only if it is > - * not associated with a weight-raised queue. > + * Add the entity, if it is not a weight-raised queue, > + * to the counter associated with its new weight. > */ > - if (prev_weight != new_weight && > - (bfqq ? bfqq->wr_coeff == 1 : 1)) > - /* If we get here, root has been initialized. */ > - bfq_weights_tree_add(bfqd, entity, root); > + if (prev_weight != new_weight) { > + if (bfqq && bfqq->wr_coeff == 1) { > + /* If we get here, root has been initialized. */ > + bfq_weights_tree_add(bfqd, bfqq, root); > + } else > + bfqd->num_active_groups++; > + } > > new_st->wsum += entity->weight; > > @@ -1012,9 +1016,9 @@ static void __bfq_activate_entity(struct bfq_entity *entity, > if (!bfq_entity_to_bfqq(entity)) { /* bfq_group */ > struct bfq_group *bfqg = > container_of(entity, struct bfq_group, entity); > + struct bfq_data *bfqd = bfqg->bfqd; > > - bfq_weights_tree_add(bfqg->bfqd, entity, > - &bfqd->group_weights_tree); > + bfqd->num_active_groups++; > } > #endif > > @@ -1692,7 +1696,7 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq) > > if (!bfqq->dispatched) > if (bfqq->wr_coeff == 1) > - bfq_weights_tree_add(bfqd, &bfqq->entity, > + bfq_weights_tree_add(bfqd, bfqq, > &bfqd->queue_weights_tree); > > if (bfqq->wr_coeff > 1) > -- > 2.19.1 >