The memory barrier in eq_update_ci() after the doorbell write is a significant hot spot in mlx5_eq_comp_int(). Under heavy TCP load, we see 3% of CPU time spent on the mfence. As explained in [1], this barrier is only needed to preserve the ordering of writes to the doorbell register. Use writel() instead of __raw_writel() for the doorbell write to provide this ordering without the need for a full memory barrier. memory-barriers.txt guarantees MMIO writes using writel() appear to the device in the same order they were made. On strongly-ordered architectures like x86, writel() adds no overhead to __raw_writel(); both translate into a single store instruction. Removing the mb() avoids the costly mfence instruction. [1]: https://lore.kernel.org/netdev/CALzJLG8af0SMfA1C8U8r_Fddb_ZQhvEZd6=2a97dOoBcgLA0xg@xxxxxxxxxxxxxx/ Signed-off-by: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> --- drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h index 4b7f7131c560..f03736711343 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h @@ -68,13 +68,12 @@ static inline struct mlx5_eqe *next_eqe_sw(struct mlx5_eq *eq) static inline void eq_update_ci(struct mlx5_eq *eq, int arm) { __be32 __iomem *addr = eq->doorbell + (arm ? 0 : 2); u32 val = (eq->cons_index & 0xffffff) | (eq->eqn << 24); - __raw_writel((__force u32)cpu_to_be32(val), addr); - /* We still want ordering, just not swabbing, so add a barrier */ - mb(); + /* Ensure ordering of consecutive doorbell writes */ + writel((__force u32)cpu_to_be32(val), addr); } int mlx5_eq_table_init(struct mlx5_core_dev *dev); void mlx5_eq_table_cleanup(struct mlx5_core_dev *dev); int mlx5_eq_table_create(struct mlx5_core_dev *dev); -- 2.45.2