On Thu, Jun 8, 2017 at 3:29 PM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote: > >>>>> My interpretation is that mlx5 tried to do this for the (rather >>>>> esoteric >>>>> in my mind) case where the platform does not have enough vectors for >>>>> the >>>>> driver to allocate percpu. In this case, the next best thing is to stay >>>>> as close to the device affinity as possible. >>>>> >>>> >>>> No, we did it for the reason that mlx5e netdevice assumes that >>>> IRQ[0]..IRQ[#num_numa/#cpu_per_numa] >>>> are always bound to the numa close to the device. and the mlx5e driver >>>> choose those IRQs to spread >>>> the RSS hash only into them and never uses other IRQs/Cores >>> >>> >>> >>> OK, that explains a lot of weirdness I've seen with mlx5e. >>> >>> Can you explain why you're using only a single numa node for your RSS >>> table? What does it buy you? You open RX rings for _all_ cpus but >>> only spread on part of them? I must be missing something here... >> >> >> Adding Tariq, >> >> this is also part of the weirdness :), we do that to make sure any OOB >> test you run you always get the best performance >> and we will guarantee to always use close numa cores. > > > Well I wish I knew that before :( I got to a point where I started > to seriously doubt the math truth of xor/toeplitz hashing strength :) > > I'm sure you ran plenty of performance tests, but from my experience, > application locality makes much more difference than device locality, > especially when the application needs to touch the data... > >> we open RX rings on all of the cores in case if the user want to >> change the RSS table to point to the whole thing on the fly "ethtool >> -X" > > > That is very counter intuitive afaict, is it documented anywhere? > > users might rely on the (absolutely reasonable) assumption that if a > NIC exposes X rx rings, rx hashing should spread across all of them and > not a subset. > This is why we want to remove this assumption from mlx5e >> But we are willing to change that, Tariq can provide the patch, >> without changing this mlx5e is broken. > > > What patch? to modify the RSS spread? What is exactly broken? current mlx5 netdev with your patches might spread traffic _ONLY_ to the far numa. To fix this in mlx5e we need something like this: diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 41cd22a223dc..15499865784f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3733,18 +3733,8 @@ void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev, u32 *indirection_rqt, int len, int num_channels) { - int node = mdev->priv.numa_node; - int node_num_of_cores; int i; - if (node == -1) - node = first_online_node; - - node_num_of_cores = cpumask_weight(cpumask_of_node(node)); - - if (node_num_of_cores) - num_channels = min_t(int, num_channels, node_num_of_cores); - we are working on such patch, and we would like to have it submitted and accepted along or before your seires. Also, we will have to run a performance & functional regression cycle on all the patches combined. > > So I'm not sure how to move forward here, should we modify the > indirection table construction to not rely on the unique affinity > mappings? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html