Re: Unintuitive scheduling results using BFQ

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Paolo,

    Thanks a lot.
    The RT behavior you described is exactly how we would expect our
RT priority task to behave.
    We are on 4.14. Will test and write back soon.

Thanks,
-Madhav.


On Tue, Dec 18, 2018 at 1:04 PM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote:
>
>
>
> > Il giorno 14 dic 2018, alle ore 19:25, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto:
> >
> > Hi Paolo,
> >
> >    Thanks a lot for your time and help.
> >
> >    I reset all the parameters for bfq and am now running with this
> > one change from the defaults.
> >    - low_latency is set to 0.
> >
> >    When tested with our applications, the IOPRIO_CLASS_RT (doing
> > direct I/O) task has no preference over the IOPRIO_CLASS_BE task (that
> > was using the kernel page cache)
> >
> >    When tested with dd, the IOPRIO_CLASS_RT task (with direct I/O)
> > was getting about 1/4th the bandwidth the IOPRIO_CLASS_BE task (with
> > buffered I/O) was getting.
> >        - RT task with direct I/O: ionice -c1 -n2 /bin/dd if=/dev/zero
> > of=/n/ssd1/ddtestRT bs=2M oflag=direct
> >        - BE task                       : /bin/dd if=/dev/zero
> > of=/n/ssd1/ddtestRT bs=2M
> >
> >    Please let me know if you want me to test more Paolo.
> >    Will look forward to hearing from you.
> >
>
> Ok, doubt confirmed: bfq didn't check properly whether mixed-class
> processes were doing I/O.  I've attached a tentative fix as a
> compressed file (to prevent my client from corrupting it).  You can
> also find the same diffs pasted below, if you want to have a look and
> comment.
>
> The patch should apply cleanly on top of current master.
>
> Let me know if it works.
>
> In particular, consider the following two facts about your RT process
> (i.e., your process with RT I/O-priority class):
> 1) When served in parallel with other I/O, your RT process cannot
> however reach a higher throughput than it reaches when served alone!
> I'm making this silly remark because you may get confused by the fact
> that modern, parallel drives usually reach a higher and higher
> throughput as the number of processes doing I/O grows.
> 2) If your RT process does I/O greedily, then it will monopolize the
> drive after this patch (if this patch works :) ).  This implies that,
> even if you have many processes doing I/O in parallel with your RT
> process, your drive will serve almost only your RT process while the
> latter is active.  Therefore parallelism may be reduced, and the total
> throughput of the drive may be lower than that reached without I/O
> control (I'm working on reducing this loss too, but it is really a
> non-trivial task).
>
> Here are the changes.
>
>  block/bfq-iosched.c | 86 ++++++++++++++++++++++++++++-------------------------
>  block/bfq-iosched.h |  8 +++--
>  block/bfq-wf2q.c    | 12 ++++++--
>  3 files changed, 59 insertions(+), 47 deletions(-)
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 97337214bec4..d8d7a32ce64b 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -623,26 +623,6 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
>                 bfqq->pos_root = NULL;
>  }
>
> -/*
> - * Tell whether there are active queues with different weights or
> - * active groups.
> - */
> -static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd)
> -{
> -       /*
> -        * For queue weights to differ, queue_weights_tree must contain
> -        * at least two nodes.
> -        */
> -       return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
> -               (bfqd->queue_weights_tree.rb_node->rb_left ||
> -                bfqd->queue_weights_tree.rb_node->rb_right)
> -#ifdef CONFIG_BFQ_GROUP_IOSCHED
> -              ) ||
> -               (bfqd->num_groups_with_pending_reqs > 0
> -#endif
> -              );
> -}
> -
>  /*
>   * The following function returns true if every queue must receive the
>   * same share of the throughput (this condition is used when deciding
> @@ -651,25 +631,48 @@ static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd)
>   *
>   * Such a scenario occurs when:
>   * 1) all active queues have the same weight,
> - * 2) all active groups at the same level in the groups tree have the same
> - *    weight,
> + * 2) all active queues belong to the same I/O-priority class,
>   * 3) all active groups at the same level in the groups tree have the same
> + *    weight,
> + * 4) all active groups at the same level in the groups tree have the same
>   *    number of children.
>   *
>   * Unfortunately, keeping the necessary state for evaluating exactly
>   * the last two symmetry sub-conditions above would be quite complex
> - * and time consuming.  Therefore this function evaluates, instead,
> - * only the following stronger two sub-conditions, for which it is
> + * and time consuming. Therefore this function evaluates, instead,
> + * only the following stronger three sub-conditions, for which it is
>   * much easier to maintain the needed state:
>   * 1) all active queues have the same weight,
> - * 2) there are no active groups.
> + * 2) all active queues belong to the same I/O-priority class,
> + * 3) there are no active groups.
>   * In particular, the last condition is always true if hierarchical
>   * support or the cgroups interface are not enabled, thus no state
>   * needs to be maintained in this case.
>   */
>  static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
>  {
> -       return !bfq_varied_queue_weights_or_active_groups(bfqd);
> +       /*
> +        * For queue weights to differ, queue_weights_tree must contain
> +        * at least two nodes.
> +        */
> +       bool varied_queue_weights = !RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
> +               (bfqd->queue_weights_tree.rb_node->rb_left ||
> +                bfqd->queue_weights_tree.rb_node->rb_right);
> +
> +       bool multiple_classes_busy =
> +               (bfqd->busy_queues[0] & bfqd->busy_queues[1]) |
> +               (bfqd->busy_queues[0] & bfqd->busy_queues[2]) |
> +               (bfqd->busy_queues[1] & bfqd->busy_queues[2]);
> +
> +       /*
> +        * For queue weights to differ, queue_weights_tree must contain
> +        * at least two nodes.
> +        */
> +       return !(varied_queue_weights || multiple_classes_busy
> +#ifdef BFQ_GROUP_IOSCHED_ENABLED
> +              || bfqd->num_groups_with_pending_reqs > 0
> +#endif
> +               );
>  }
>
>  /*
> @@ -728,15 +731,14 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>         /*
>          * In the unlucky event of an allocation failure, we just
>          * exit. This will cause the weight of queue to not be
> -        * considered in bfq_varied_queue_weights_or_active_groups,
> -        * which, in its turn, causes the scenario to be deemed
> -        * wrongly symmetric in case bfqq's weight would have been
> -        * the only weight making the scenario asymmetric.  On the
> -        * bright side, no unbalance will however occur when bfqq
> -        * becomes inactive again (the invocation of this function
> -        * is triggered by an activation of queue).  In fact,
> -        * bfq_weights_tree_remove does nothing if
> -        * !bfqq->weight_counter.
> +        * considered in bfq_symmetric_scenario, which, in its turn,
> +        * causes the scenario to be deemed wrongly symmetric in case
> +        * bfqq's weight would have been the only weight making the
> +        * scenario asymmetric.  On the bright side, no unbalance will
> +        * however occur when bfqq becomes inactive again (the
> +        * invocation of this function is triggered by an activation
> +        * of queue).  In fact, bfq_weights_tree_remove does nothing
> +        * if !bfqq->weight_counter.
>          */
>         if (unlikely(!bfqq->weight_counter))
>                 return;
> @@ -2217,7 +2219,7 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>                 return NULL;
>
>         /* If there is only one backlogged queue, don't search. */
> -       if (bfqd->busy_queues == 1)
> +       if (bfq_tot_busy_queues(bfqd) == 1)
>                 return NULL;
>
>         in_service_bfqq = bfqd->in_service_queue;
> @@ -3655,7 +3657,8 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
>          * the requests already queued in the device have been served.
>          */
>         asymmetric_scenario = (bfqq->wr_coeff > 1 &&
> -                              bfqd->wr_busy_queues < bfqd->busy_queues) ||
> +                              bfqd->wr_busy_queues <
> +                              bfq_tot_busy_queues(bfqd)) ||
>                 !bfq_symmetric_scenario(bfqd);
>
>         /*
> @@ -3934,7 +3937,7 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
>          * belongs to CLASS_IDLE and other queues are waiting for
>          * service.
>          */
> -       if (!(bfqd->busy_queues > 1 && bfq_class_idle(bfqq)))
> +       if (!(bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)))
>                 goto return_rq;
>
>         bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
> @@ -3952,7 +3955,7 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
>          * most a call to dispatch for nothing
>          */
>         return !list_empty_careful(&bfqd->dispatch) ||
> -               bfqd->busy_queues > 0;
> +               bfq_tot_busy_queues(bfqd) > 0;
>  }
>
>  static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
> @@ -4006,9 +4009,10 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
>                 goto start_rq;
>         }
>
> -       bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
> +       bfq_log(bfqd, "dispatch requests: %d busy queues",
> +               bfq_tot_busy_queues(bfqd));
>
> -       if (bfqd->busy_queues == 0)
> +       if (bfq_tot_busy_queues(bfqd) == 0)
>                 goto exit;
>
>         /*
> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
> index 0b02bf302de0..30be669be465 100644
> --- a/block/bfq-iosched.h
> +++ b/block/bfq-iosched.h
> @@ -501,10 +501,11 @@ struct bfq_data {
>         unsigned int num_groups_with_pending_reqs;
>
>         /*
> -        * Number of bfq_queues containing requests (including the
> -        * queue in service, even if it is idling).
> +        * Per-class (RT, BE, IDLE) number of bfq_queues containing
> +        * requests (including the queue in service, even if it is
> +        * idling).
>          */
> -       int busy_queues;
> +       unsigned int busy_queues[3];
>         /* number of weight-raised busy @bfq_queues */
>         int wr_busy_queues;
>         /* number of queued requests */
> @@ -974,6 +975,7 @@ extern struct blkcg_policy blkcg_policy_bfq;
>
>  struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq);
>  struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
> +unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd);
>  struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity);
>  struct bfq_entity *bfq_entity_of(struct rb_node *node);
>  unsigned short bfq_ioprio_to_weight(int ioprio);
> diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
> index 63e0f12be7c9..7119207cae3c 100644
> --- a/block/bfq-wf2q.c
> +++ b/block/bfq-wf2q.c
> @@ -44,6 +44,12 @@ static unsigned int bfq_class_idx(struct bfq_entity *entity)
>                 BFQ_DEFAULT_GRP_CLASS - 1;
>  }
>
> +unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd)
> +{
> +       return bfqd->busy_queues[0] + bfqd->busy_queues[1] +
> +               bfqd->busy_queues[2];
> +}
> +
>  static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
>                                                  bool expiration);
>
> @@ -1514,7 +1520,7 @@ struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
>         struct bfq_sched_data *sd;
>         struct bfq_queue *bfqq;
>
> -       if (bfqd->busy_queues == 0)
> +       if (bfq_tot_busy_queues(bfqd) == 0)
>                 return NULL;
>
>         /*
> @@ -1666,7 +1672,7 @@ void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>
>         bfq_clear_bfqq_busy(bfqq);
>
> -       bfqd->busy_queues--;
> +       bfqd->busy_queues[bfqq->ioprio_class - 1]--;
>
>         if (!bfqq->dispatched)
>                 bfq_weights_tree_remove(bfqd, bfqq);
> @@ -1689,7 +1695,7 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
>         bfq_activate_bfqq(bfqd, bfqq);
>
>         bfq_mark_bfqq_busy(bfqq);
> -       bfqd->busy_queues++;
> +       bfqd->busy_queues[bfqq->ioprio_class - 1]++;
>
>         if (!bfqq->dispatched)
>                 if (bfqq->wr_coeff == 1)
>
> Thanks,
> Paolo
>
>
>
> > Thanks,
> > Madhav.
> >
> >
> > On Fri, Dec 14, 2018 at 12:20 PM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >>> Il giorno 14 dic 2018, alle ore 14:55, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto:
> >>>
> >>> Hi Paolo,
> >>>
> >>>   Thanks a lot for your work and your response to this email.
> >>>
> >>
> >> Thank you for trying bfq!
> >>
> >>>   Following your advise, I switched my real time application to direct I/O.
> >>>   We now have
> >>>
> >>>   Task1: Using ionice -c1, we run a RT IO/ O_DIRECT task that writes
> >>> to flood the NVME drive to its capacity.
> >>>   Task2: We run a normal (best-effort) IO/ asyn(page-cache buffered)
> >>> task that writes to flood the NVME drive to its capacity.
> >>>
> >>>   What we now see is that Task2 still ends up getting about 3/5th or
> >>> more of the NVMe bandwidth and Task1 ends up getting the rest of the
> >>> NVMe disk bandwidth. Could the kernel threads/buffering be
> >>> overpowering the RT priority of Task1?
> >>>
> >>>   What we desire is to loosely ensure that Task1 gets as much
> >>> bandwidth as it asks for in any iteration while Task2 and the
> >>> remaining tasks share the leftover bandwidth.
> >>>
> >>>   We are currently testing with these settings in BFQ.
> >>>   low_latency = 0 (to control the bandwidth allocation)
> >>
> >> This may be a good idea.  But, after we sort this out, you could try
> >> leaving low_latency enabled.  It should cause no harm (I'm trying to
> >> make bfq smarter and smarter).
> >>
> >>
> >>>   slice_idle = 0     (we are using a fast NVMe and if Task1 does not
> >>> have any requests and control goes to Task2, it seems to make sense we
> >>> get the control back to Task1 quickly)
> >>
> >> Unfortunately, this is not correct.  Setting slice_idle to 0 means
> >> losing completely control on I/O.  And in you case you need ...
> >> control :)
> >>
> >> So, you'd better leave slice_idle as it is.
> >>
> >> The actual problem is that, while thinking on whether it was
> >> reasonable to set slice_idle == 0, I realized that bfq is likely to fail
> >> even with slice_idle > 0.  In fact, I have not checked the respect of
> >> RT priority for probably a few years.  And I have changed lots of
> >> critical operations in these years.
> >>
> >> So, I'll wait for your feedback, which I do expect to be negative.
> >> Then, if it is actually negative, I'll work on the bug behind the
> >> failure.
> >>
> >>>   timeout_sync = 1 (Task1 does application level buffering and sends
> >>> the biggest chunk of data available (in high MB's) always)
> >>>
> >>
> >> Also for this one, things should go well even without touching it.
> >> But we will find out after discovering whether bfq is broken for your
> >> use case.
> >>
> >> Looking forward to your feedback,
> >> Paolo
> >>
> >>>   We are unable to make the leap to cpugroups at this time Paolo. Is
> >>> there anything we can tune in BFQ or change the way we generate
> >>> traffic in Task1 to ensure that Task1 gets the bandwidth it asks for?
> >>>
> >>>   A rough approximation to the Task1 and Task2 traffic we discussed
> >>> above seem to be these instantiations of dd.
> >>>   Task1: ionice -c1 -n2 /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT1
> >>> bs=2M oflag=direct
> >>>   Task2: /bin/dd if=/dev/zero of=/n/ssd1/ddtestRT2   bs=2M
> >>>
> >>> Thanks again Paolo,
> >>> Madhav.
> >>>
> >>>
> >>>
> >>> On Fri, Dec 14, 2018 at 1:38 AM Paolo Valente <paolo.valente@xxxxxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Il giorno 13 dic 2018, alle ore 21:34, Madhav Ancha <mancha@xxxxxxxxxxxxxxxxxx> ha scritto:
> >>>>>
> >>>>> In our setup, we have a task that writes to a NVMe SSD drive using the
> >>>>> page cache. (using ::write os calls). This task does application level
> >>>>> buffering and sends big (large MBs) chunks of data to ::write call.
> >>>>> Each instance of the task writes upto 10Gbps of data to the NVME SSD.
> >>>>>
> >>>>> We run two instances of this task as below.
> >>>>> Instance 1: Using ionice -c1, we run a RT IO instance of this task.
> >>>>> Instance 2: We run a normal (best-effort) IO instance of this task.
> >>>>>
> >>>>> Both the write task instances compete for NVMe bandwidth. We observe
> >>>>> that BFQ allocates equal bandwidth to both the task instances starting
> >>>>> a few seconds after they start up.
> >>>>>
> >>>>> What we expected is that Instance1 (IOPRIO_CLASS_RT scheduling class)
> >>>>> will be granted all the bandwidth it asked for while Instance2 will be
> >>>>> allowed to consume the remaining bandwidth.
> >>>>>
> >>>>> Could you please help us understand how we may be able to design to
> >>>>> get our expected behavior.
> >>>>>
> >>>>
> >>>> Hi,
> >>>> if you do async, in-memory writes, then your task instances just dirty
> >>>> vm pages.  Then different processes, the kworkers, will do the
> >>>> writeback of dirty pages asynchronously, according to the system
> >>>> writeback logic and configuration.  kworkers have their own priority,
> >>>> which is likely to be the same for each such process.  AFAICT this
> >>>> priority is not related to the priority you give to your processes.
> >>>>
> >>>> If you want to control I/O bandwidths for writes, go for direct I/O or
> >>>> use cgroups.  In case of cgroups, consider that there is still the
> >>>> oddity that bfq interface parameters are non standard.  We have
> >>>> proposed and are pushing for a solution to this problem [1].
> >>>>
> >>>> Thanks,
> >>>> Paolo
> >>>>
> >>>> [1] https://lkml.org/lkml/2018/11/19/366
> >>>>
> >>>>> Thanks
> >>>>
> >>
>



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux