Hello, On Tue 02-11-21 19:16:58, Michal Koutný wrote: > On Wed, Oct 06, 2021 at 07:31:43PM +0200, Jan Kara <jack@xxxxxxx> wrote: > > + for (level--; level >= 0; level--) { > > + entity = entities[level]; > > + if (level > 0) { > > + wsum = bfq_entity_service_tree(entity)->wsum; > > + } else { > > + int i; > > + /* > > + * For bfqq itself we take into account service trees > > + * of all higher priority classes and multiply their > > + * weights so that low prio queue from higher class > > + * gets more requests than high prio queue from lower > > + * class. > > + */ > > + wsum = 0; > > + for (i = 0; i <= class_idx; i++) { > > + wsum = wsum * IOPRIO_BE_NR + > > + sched_data->service_tree[i].wsum; > > + } > > + } > > + limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum); > > This scheme caught my eye. You mutliply (tree) weights by a factor > depending on the class when counting the wsum but then you don't apply > the same scaling for the evaluated entity in the numerator. Since we stop the loop at bfq_class_idx(entity) I don't think scaling of the numerator makes sense - effectively when all the processes having IO to submit (these are accounted in wsum) are in the same IO priority class, the above code collapses to just: limit = limit * entity->weight / sched_data->service_tree[bfq_class_idx(entity)].wsum I.e., we scale available tags proportionally to bfq_queue weight (which scales linearly with IO priority). When there are processes say both in RT and BE IO priority classes, then a process in RT class still uses the above formula - i.e., as if all tags available for a cgroup are split proportionally only among active tasks in RT IO priority class. So in principle it can happen that there would be no tag left for a process in lower IO priority class - and that is fine, we don't care, because we don't want to submit IO from lower IO priority class while there is still IO in higher IO priority class. Now consider a situation for a process in BE IO priority class in this setting. All processes in BE class can together occupy at most BE_wsum / (RT_wsum * IOPRIO_BE_NR + BE_wsum) fraction of tags. This is admittedly somewhat arbitrary fraction but it makes sure for each process in RT class there are at least as many tags left as for the highest priority process in BE class. > IOW, I think there should be something like > scale = (level > 0) ? 1 : int_pow(IOPRIO_BE_NR, BFQ_IOPRIO_CLASSES - bfq_class_idx(entity)); > limit = DIV_ROUND_CLOSEST(limit * entity->weight * scale, wsum); > > For instance, if there are two cgroups (level=1) with weights 100 and > 200, and each cgroup has a single IOPRIO_CLASS_BE entity (level=0) in > it, the `limit` distribution would honor the ratio of weights from > level=1 (100:200) but it would artificially lower the absolute amount of > allowed tags. If I am not mistaken, that would be reduced by factor > 1/BFQ_IOPRIO_CLASSES. I don't see where my code would lead to available number of tags being artifically lower as you write - in your example wsum for RT class would be 0 so effectively all terms of the formula for that class will cancel out. As I wrote above, the highest active IO priority class effectively allows processes in this class to consume all tags available for a cgroup. If there are lower IO priority classes active as well, we allow them to consume some tags but never allow them to consume all of them... > Also if I consider it more broadly, is this supposed to match/extend > bfq_io_prio_to_weight() calculation? Yes, this is kind of an extension of bfq_io_prio_to_weight() that allows some comparison of queues from different IO priority classes. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR