On Tue, Nov 5, 2024 at 9:44 PM Parav Pandit <parav@xxxxxxxxxx> wrote: > > > > From: Caleb Sander <csander@xxxxxxxxxxxxxxx> > > Sent: Tuesday, November 5, 2024 9:36 PM > > > > On Mon, Nov 4, 2024 at 9:22 PM Parav Pandit <parav@xxxxxxxxxx> wrote: > > > > > > > > > > > > > From: Caleb Sander <csander@xxxxxxxxxxxxxxx> > > > > Sent: Monday, November 4, 2024 3:49 AM > > > > > > > > On Sat, Nov 2, 2024 at 8:55 PM Parav Pandit <parav@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > From: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> > > > > > > Sent: Friday, November 1, 2024 9:17 AM > > > > > > > > > > > > The logic of eq_update_ci() is duplicated in mlx5_eq_update_ci(). > > > > > > The only additional work done by mlx5_eq_update_ci() is to > > > > > > increment > > > > > > eq->cons_index. Call eq_update_ci() from mlx5_eq_update_ci() to > > > > > > eq->avoid > > > > > > the duplication. > > > > > > > > > > > > Signed-off-by: Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> > > > > > > --- > > > > > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 9 +-------- > > > > > > 1 file changed, 1 insertion(+), 8 deletions(-) > > > > > > > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > > > > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > > > > index 859dcf09b770..078029c81935 100644 > > > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > > > > > @@ -802,19 +802,12 @@ struct mlx5_eqe *mlx5_eq_get_eqe(struct > > > > > > mlx5_eq *eq, u32 cc) } EXPORT_SYMBOL(mlx5_eq_get_eqe); > > > > > > > > > > > > void mlx5_eq_update_ci(struct mlx5_eq *eq, u32 cc, bool arm) { > > > > > > - __be32 __iomem *addr = eq->doorbell + (arm ? 0 : 2); > > > > > > - u32 val; > > > > > > - > > > > > > eq->cons_index += cc; > > > > > > - val = (eq->cons_index & 0xffffff) | (eq->eqn << 24); > > > > > > - > > > > > > - __raw_writel((__force u32)cpu_to_be32(val), addr); > > > > > > - /* We still want ordering, just not swabbing, so add a barrier */ > > > > > > - wmb(); > > > > > > + eq_update_ci(eq, arm); > > > > > Long ago I had similar rework patches to get rid of > > > > > __raw_writel(), which I never got chance to push, > > > > > > > > > > Eq_update_ci() is using full memory barrier. > > > > > While mlx5_eq_update_ci() is using only write memory barrier. > > > > > > > > > > So it is not 100% deduplication by this patch. > > > > > Please have a pre-patch improving eq_update_ci() to use wmb(). > > > > > Followed by this patch. > > > > > > > > Right, patch 1/2 in this series is changing eq_update_ci() to use > > > > writel() instead of __raw_writel() and avoid the memory barrier: > > > > https://lore.kernel.org/lkml/20241101034647.51590-1- > > > > csander@xxxxxxxxxxxxxxx/ > > > This patch has two bugs. > > > 1. writel() writes the MMIO space in LE order. EQ updates are in BE order. > > > So this will break on ppc64 BE. > > > > Okay, so this should be writel(cpu_to_le32(val), addr)? > > > That would break the x86 side because device should receive in BE format regardless of cpu endianness. > Above code will write in the LE format. > > So an API foo_writel() need which does > a. write memory barrier > b. write to MMIO space but without endineness conversion. Got it, thanks. writel(bswap_32(val, addr)) should work, then? I suppose it may introduce a second bswap on BE architectures, but that's probably worth it to avoid the memory barrier. > > > > > > > 2. writel() issues the barrier BEFORE the raw_writel(). > > > As opposed to that eq update needs to have a barrier AFTER the writel(). > > > Likely to synchronize with other CQ related pointers update. > > > > I was referencing this prior discussion about the memory barrier: > > https://lore.kernel.org/netdev/CALzJLG8af0SMfA1C8U8r_Fddb_ZQhvEZd6=2 > > a97dOoBcgLA0xg@xxxxxxxxxxxxxx/ > > From Saeed's message, it sounds like the memory barrier is only used to > > ensure the ordering of writes to the doorbell register, not the ordering of the > > doorbell write relative to any other writes. If some other write needs to be > > ordered after the doorbell write, please explain what it is. > Not write, reading of the CQE likely requires read barrier. But mlx5_eq_update_ci() is already using wmb(), which only imposes an ordering on writes. So if the existing code is correct, the memory barrier cannot be required to order the doorbell write with respect to a read of the CQE. > > > As Gal Pressman > > pointed out, a wmb() at the end of a function doesn't make much sense, as > > there are no further writes in the function to order. If the doorbell write needs > > to be ordered before some other write in a caller function, the memory barrier > > should probably move to the caller. > It is the two EQ doorbell writes that needs to be ordered with respect to each other. > So please audit the code for CQE processing ensure that there is read barrier after valid bit. > And removal of this read barrier does not affect there. > > It would be best if you can test on ARM (non x86_64) platform for this change. Unfortunately I don't have access to any platform besides x86_64 with ConnectX cards. > > > > > > > > > > Are you suggesting something different? If so, it would be great if > > > > you could clarify what you mean. > > > > > > > So I was suggesting to keep __raw_writel() as is and replace mb() with > > wmb(). > > > > wmb() would certainly be cheaper than mb(), but I would like to understand > > the requirement for the barrier in the first place. The fence instruction is very > > expensive. > > > To order two doorbell writes of the same EQ. Right, I understand why the memory barrier is needed with the existing __raw_writel(), as it provides no ordering guarantees. But writel() seems to guarantee the necessary ordering of writes to the EQ's doorbell. This is the relevant documentation from memory-barriers.txt (I assume the mutual exclusion of interrupt handlers is equivalent to holding a spinlock): 2. A writeX() issued by a CPU thread holding a spinlock is ordered before a writeX() to the same peripheral from another CPU thread issued after a later acquisition of the same spinlock. This ensures that MMIO register writes to a particular device issued while holding a spinlock will arrive in an order consistent with acquisitions of the lock. Do you still think the barrier is necessary if writel() is used instead? If you feel strongly about keeping the wmb(), I can do that. It's certainly an improvement over the full memory barrier. But fences are quite expensive, so I would prefer to remove it if it's not necessary for correctness. Thanks, Caleb