2015-07-09 13:21 GMT+02:00 Or Gerlitz <ogerlitz@xxxxxxxxxxxx>: > On 7/9/2015 2:14 PM, Jack Wang wrote: >> >> I managed to update the kernel to OFED 3.0 to verify the bug, but I >> can still produce the bug, maybe there're still some synchronice_irq >> is missing? > > > Again, even if you don't use the upstream kernel for production, I suggest > you > try to reproduce the bug there and if it exists we'll try to solve it on > upstream > and later port to MLNX OFED, makes sense?You can start with just the > installed 3.18.14 > > Or. Hello Or, We have other kernel modules together also the autotest infrastructure. It's not that easy to install a 3.18.14 kernel. I look into the code a little bit. I think the bug may relate radix_tree usage in mlx4_cq_free , OFED code in radix_tree_delete before synchronize_irq, but mainline code call radix_tree_delete after synchronize_irq, does this matter? I'm building a new kernel with this small change: --- a/drivers/net/ethernet/mellanox/mlx4/cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c @@ -393,16 +393,16 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) if (err) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); - spin_lock(&cq_table->lock); - radix_tree_delete(&cq_table->tree, cq->cqn); - spin_unlock(&cq_table->lock); - synchronize_irq(priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq); /* synchronize ASYNC irq */ if (priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(cq->vector)].irq != priv->eq_table.eq[MLX4_EQ_ASYNC].irq) synchronize_irq(priv->eq_table.eq[MLX4_EQ_ASYNC].irq); + spin_lock(&cq_table->lock); + radix_tree_delete(&cq_table->tree, cq->cqn); + spin_unlock(&cq_table->lock); + if (atomic_dec_and_test(&cq->refcount)) complete(&cq->free); wait_for_completion(&cq->free); Thanks, Jack -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html