> From: Caleb Sander <csander@xxxxxxxxxxxxxxx> > Sent: Tuesday, October 29, 2024 10:02 PM > > On Mon, Oct 28, 2024 at 9:08 PM Parav Pandit <parav@xxxxxxxxxx> wrote: > > > > Hi > > > > > From: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> > > > Sent: Sunday, October 27, 2024 9:37 AM > > > > > > Currently, the mlx5_eq_comp_int() interrupt handler schedules a > > > tasklet to call > > > mlx5_cq_tasklet_cb() if it processes any completions. For CQs whose > > > completions don't need to be processed in tasklet context, this > > > overhead is unnecessary. Atomic operations are needed to schedule, > > > lock, and clear the tasklet. And when mlx5_cq_tasklet_cb() runs, it > > > acquires a spin lock to access the list of CQs enqueued for processing. > > > > > > Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid > > > this overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing > > > the CQs to be processed in tasklet context, so it can schedule the > > > tasklet. CQs that need tasklet processing have their interrupt comp > > > handler set to mlx5_add_cq_to_tasklet(), so they will schedule the > > > tasklet. CQs that don't need tasklet processing won't schedule the > > > tasklet. To avoid scheduling the tasklet multiple times during the > > > same interrupt, only schedule the tasklet in > > > mlx5_add_cq_to_tasklet() if the tasklet work queue was empty before > > > the new CQ was pushed to it. > > > > > > Note that the mlx4 driver works the same way: it schedules the > > > tasklet in > > > mlx4_add_cq_to_tasklet() and only if the work queue was empty before. > > > > > > Signed-off-by: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> > > > --- > > > drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++ > > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +---- > > > 2 files changed, 6 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c > > > b/drivers/net/ethernet/mellanox/mlx5/core/cq.c > > > index 4caa1b6f40ba..25f3b26db729 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c > > > @@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct > > > *t) static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq, > > > struct mlx5_eqe *eqe) { > > > unsigned long flags; > > > struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv; > > > + bool schedule_tasklet = false; > > > > > > spin_lock_irqsave(&tasklet_ctx->lock, flags); > > > /* When migrating CQs between EQs will be implemented, please > note > > > * that you need to sync this point. It is possible that > > > * while migrating a CQ, completions on the old EQs could > > > * still arrive. > > > */ > > > if (list_empty_careful(&cq->tasklet_ctx.list)) { > > > mlx5_cq_hold(cq); > > > + schedule_tasklet = list_empty(&tasklet_ctx->list); > > > list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list); > > > } > > > spin_unlock_irqrestore(&tasklet_ctx->lock, flags); > > > + > > > + if (schedule_tasklet) > > > + tasklet_schedule(&tasklet_ctx->task); > > > } > > > > > > /* Callers must verify outbox status in case of err */ int > > > mlx5_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq, > > > u32 *in, int inlen, u32 *out, int outlen) diff > > > --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > index 68cb86b37e56..66fc17d9c949 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > @@ -112,17 +112,17 @@ static int mlx5_eq_comp_int(struct > > > notifier_block *nb, > > > struct mlx5_eq_comp *eq_comp = > > > container_of(nb, struct mlx5_eq_comp, irq_nb); > > > struct mlx5_eq *eq = &eq_comp->core; > > > struct mlx5_eqe *eqe; > > > int num_eqes = 0; > > > - u32 cqn = -1; > > > > > > eqe = next_eqe_sw(eq); > > > if (!eqe) > > > goto out; > > > > > > do { > > > + u32 cqn; > > > struct mlx5_core_cq *cq; > > > > > A small nit, cqn should be declared after cq to follow the netdev coding > guidelines of [1]. > > Sure, will fix. Thanks for the reference. > > > > > > /* Make sure we read EQ entry contents after we've > > > * checked the ownership bit. > > > */ > > > @@ -145,13 +145,10 @@ static int mlx5_eq_comp_int(struct > > > notifier_block *nb, > > > } while ((++num_eqes < MLX5_EQ_POLLING_BUDGET) && (eqe = > > > next_eqe_sw(eq))); > > > > > > out: > > > eq_update_ci(eq, 1); > > > > > > - if (cqn != -1) > > > - tasklet_schedule(&eq_comp->tasklet_ctx.task); > > > - > > Current code processes many EQEs and performs the check for > tasklet_schedule only once in the cqn check. > > While this change, on every EQE, the additional check will be done. > > This will marginally make the interrupt handler slow. > > Returning a bool from comp() wont be good either, and we cannot inline > things here due to function pointer. > > > > The cost of scheduling null tasklet is higher than this if (schedule_tasklet) > check. > > In other series internally, I am working to reduce the cost of comp() itself > unrelated to this change. > > so it ok to have the additional check introduced here. > > Right, there's definitely a tradeoff here. > From what I could tell, there is only one CQ type that processes completions > in tasklet context (user Infiniband CQs, running mlx5_ib_cq_comp()). All > others handle their completions in interrupt context. Ideally the CQ types > that don't need it would not pay the cost of the tasklet schedule and > execution. There are several atomic operations involved in the tasklet path > which are fairly expensive. In our TCP-heavy workload, we see 4% of the CPU > time spent on the > tasklet_trylock() in tasklet_action_common.constprop.0, with a smaller > amount spent on the atomic operations in tasklet_schedule(), > tasklet_clear_sched(), and acquiring the spinlock in mlx5_cq_tasklet_cb(). Please include this perf stats in the commit log in v1, which helps to decide on the trade-off and also quantifies it. > I agree the additional branch per EQE should be cheaper than scheduling the > unused tasklet, but the cost would be paid by Infiniband workloads while > non-Infiniband workloads see the benefit. > How about instead scheduling the tasklet in mlx5_eq_comp_int() if any of > the CQs have a tasklet completion handler? That should get the best of both > worlds: skipping the tasklet schedule for CQs that don't need it while > ensuring the tasklet is only scheduled once per interrupt. > Something like this: > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > index 68cb86b37e56..f0ba3725b8e9 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > @@ -112,9 +112,9 @@ static int mlx5_eq_comp_int(struct notifier_block > *nb, > struct mlx5_eq_comp *eq_comp = > container_of(nb, struct mlx5_eq_comp, irq_nb); > struct mlx5_eq *eq = &eq_comp->core; > + bool schedule_tasklet = false; > struct mlx5_eqe *eqe; > int num_eqes = 0; > - u32 cqn = -1; > > eqe = next_eqe_sw(eq); > if (!eqe) > @@ -122,6 +122,7 @@ static int mlx5_eq_comp_int(struct notifier_block > *nb, > > do { > struct mlx5_core_cq *cq; > + u32 cqn; > > /* Make sure we read EQ entry contents after we've > * checked the ownership bit. > @@ -134,6 +135,7 @@ static int mlx5_eq_comp_int(struct notifier_block > *nb, > if (likely(cq)) { > ++cq->arm_sn; > cq->comp(cq, eqe); > + schedule_tasklet |= !!cq->tasklet_ctx.comp; > mlx5_cq_put(cq); > } else { > dev_dbg_ratelimited(eq->dev->device, > @@ -147,7 +149,7 @@ static int mlx5_eq_comp_int(struct notifier_block > *nb, > out: > eq_update_ci(eq, 1); > > - if (cqn != -1) > + if (schedule_tasklet) > tasklet_schedule(&eq_comp->tasklet_ctx.task); > > return 0; > The issue in the above change is, it makes assumption of CQ layer (upper layer above interrupt) if tasklet to be used or not. I would rather keep the two layers separate as your patch. > Thanks, > Caleb