RE: [PATCH] mlx5: only schedule EQ comp tasklet if necessary

Parav Pandit <parav@xxxxxxxxxx> · Wed, 30 Oct 2024 03:16:58 +0000



> From: Caleb Sander <csander@xxxxxxxxxxxxxxx>
> Sent: Tuesday, October 29, 2024 10:02 PM
> 
> On Mon, Oct 28, 2024 at 9:08 PM Parav Pandit <parav@xxxxxxxxxx> wrote:
> >
> > Hi
> >
> > > From: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx>
> > > Sent: Sunday, October 27, 2024 9:37 AM
> > >
> > > Currently, the mlx5_eq_comp_int() interrupt handler schedules a
> > > tasklet to call
> > > mlx5_cq_tasklet_cb() if it processes any completions. For CQs whose
> > > completions don't need to be processed in tasklet context, this
> > > overhead is unnecessary. Atomic operations are needed to schedule,
> > > lock, and clear the tasklet. And when mlx5_cq_tasklet_cb() runs, it
> > > acquires a spin lock to access the list of CQs enqueued for processing.
> > >
> > > Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid
> > > this overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing
> > > the CQs to be processed in tasklet context, so it can schedule the
> > > tasklet. CQs that need tasklet processing have their interrupt comp
> > > handler set to mlx5_add_cq_to_tasklet(), so they will schedule the
> > > tasklet. CQs that don't need tasklet processing won't schedule the
> > > tasklet. To avoid scheduling the tasklet multiple times during the
> > > same interrupt, only schedule the tasklet in
> > > mlx5_add_cq_to_tasklet() if the tasklet work queue was empty before
> > > the new CQ was pushed to it.
> > >
> > > Note that the mlx4 driver works the same way: it schedules the
> > > tasklet in
> > > mlx4_add_cq_to_tasklet() and only if the work queue was empty before.
> > >
> > > Signed-off-by: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx>
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++
> > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +----
> > >  2 files changed, 6 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > > b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > > index 4caa1b6f40ba..25f3b26db729 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > > @@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct
> > > *t) static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq,
> > >                                  struct mlx5_eqe *eqe)  {
> > >       unsigned long flags;
> > >       struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> > > +     bool schedule_tasklet = false;
> > >
> > >       spin_lock_irqsave(&tasklet_ctx->lock, flags);
> > >       /* When migrating CQs between EQs will be implemented, please
> note
> > >        * that you need to sync this point. It is possible that
> > >        * while migrating a CQ, completions on the old EQs could
> > >        * still arrive.
> > >        */
> > >       if (list_empty_careful(&cq->tasklet_ctx.list)) {
> > >               mlx5_cq_hold(cq);
> > > +             schedule_tasklet = list_empty(&tasklet_ctx->list);
> > >               list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
> > >       }
> > >       spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
> > > +
> > > +     if (schedule_tasklet)
> > > +             tasklet_schedule(&tasklet_ctx->task);
> > >  }
> > >
> > >  /* Callers must verify outbox status in case of err */  int
> > > mlx5_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq,
> > >                  u32 *in, int inlen, u32 *out, int outlen) diff
> > > --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > index 68cb86b37e56..66fc17d9c949 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > @@ -112,17 +112,17 @@ static int mlx5_eq_comp_int(struct
> > > notifier_block *nb,
> > >       struct mlx5_eq_comp *eq_comp =
> > >               container_of(nb, struct mlx5_eq_comp, irq_nb);
> > >       struct mlx5_eq *eq = &eq_comp->core;
> > >       struct mlx5_eqe *eqe;
> > >       int num_eqes = 0;
> > > -     u32 cqn = -1;
> > >
> > >       eqe = next_eqe_sw(eq);
> > >       if (!eqe)
> > >               goto out;
> > >
> > >       do {
> > > +             u32 cqn;
> > >               struct mlx5_core_cq *cq;
> > >
> > A small nit, cqn should be declared after cq to follow the netdev coding
> guidelines of [1].
> 
> Sure, will fix. Thanks for the reference.
> 
> >
> > >               /* Make sure we read EQ entry contents after we've
> > >                * checked the ownership bit.
> > >                */
> > > @@ -145,13 +145,10 @@ static int mlx5_eq_comp_int(struct
> > > notifier_block *nb,
> > >       } while ((++num_eqes < MLX5_EQ_POLLING_BUDGET) && (eqe =
> > > next_eqe_sw(eq)));
> > >
> > >  out:
> > >       eq_update_ci(eq, 1);
> > >
> > > -     if (cqn != -1)
> > > -             tasklet_schedule(&eq_comp->tasklet_ctx.task);
> > > -
> > Current code processes many EQEs and performs the check for
> tasklet_schedule only once in the cqn check.
> > While this change, on every EQE, the additional check will be done.
> > This will marginally make the interrupt handler slow.
> > Returning a bool from comp() wont be good either, and we cannot inline
> things here due to function pointer.
> >
> > The cost of scheduling null tasklet is higher than this if (schedule_tasklet)
> check.
> > In other series internally, I am working to reduce the cost of comp() itself
> unrelated to this change.
> > so it ok to have the additional check introduced here.
> 
> Right, there's definitely a tradeoff here.
> From what I could tell, there is only one CQ type that processes completions
> in tasklet context (user Infiniband CQs, running mlx5_ib_cq_comp()). All
> others handle their completions in interrupt context. Ideally the CQ types
> that don't need it would not pay the cost of the tasklet schedule and
> execution. There are several atomic operations involved in the tasklet path
> which are fairly expensive. In our TCP-heavy workload, we see 4% of the CPU
> time spent on the
> tasklet_trylock() in tasklet_action_common.constprop.0, with a smaller
> amount spent on the atomic operations in tasklet_schedule(),
> tasklet_clear_sched(), and acquiring the spinlock in mlx5_cq_tasklet_cb().
Please include this perf stats in the commit log in v1, which helps to decide on the trade-off and also quantifies it.

> I agree the additional branch per EQE should be cheaper than scheduling the
> unused tasklet, but the cost would be paid by Infiniband workloads while
> non-Infiniband workloads see the benefit.
> How about instead scheduling the tasklet in mlx5_eq_comp_int() if any of
> the CQs have a tasklet completion handler? That should get the best of both
> worlds: skipping the tasklet schedule for CQs that don't need it while
> ensuring the tasklet is only scheduled once per interrupt.
> Something like this:
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index 68cb86b37e56..f0ba3725b8e9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -112,9 +112,9 @@ static int mlx5_eq_comp_int(struct notifier_block
> *nb,
>         struct mlx5_eq_comp *eq_comp =
>                 container_of(nb, struct mlx5_eq_comp, irq_nb);
>         struct mlx5_eq *eq = &eq_comp->core;
> +       bool schedule_tasklet = false;
>         struct mlx5_eqe *eqe;
>         int num_eqes = 0;
> -       u32 cqn = -1;
> 
>         eqe = next_eqe_sw(eq);
>         if (!eqe)
> @@ -122,6 +122,7 @@ static int mlx5_eq_comp_int(struct notifier_block
> *nb,
> 
>         do {
>                 struct mlx5_core_cq *cq;
> +               u32 cqn;
> 
>                 /* Make sure we read EQ entry contents after we've
>                  * checked the ownership bit.
> @@ -134,6 +135,7 @@ static int mlx5_eq_comp_int(struct notifier_block
> *nb,
>                 if (likely(cq)) {
>                         ++cq->arm_sn;
>                         cq->comp(cq, eqe);
> +                       schedule_tasklet |= !!cq->tasklet_ctx.comp;
>                         mlx5_cq_put(cq);
>                 } else {
>                         dev_dbg_ratelimited(eq->dev->device,
> @@ -147,7 +149,7 @@ static int mlx5_eq_comp_int(struct notifier_block
> *nb,
>  out:
>         eq_update_ci(eq, 1);
> 
> -       if (cqn != -1)
> +       if (schedule_tasklet)
>                 tasklet_schedule(&eq_comp->tasklet_ctx.task);
> 
>         return 0;
> 
The issue in the above change is, it makes assumption of CQ layer (upper layer above interrupt) if tasklet to be used or not.
I would rather keep the two layers separate as your patch.

> Thanks,
> Caleb