Re: [PATCH mlx5-next 1/3] net/mlx5: Nullify eq->dbg and qp->dbg pointers post destruction

Saeed Mahameed <saeedm@xxxxxxxxxx> · Fri, 8 Apr 2022 12:30:35 -0700

On 06 Apr 10:55, Leon Romanovsky wrote:
On Tue, Apr 05, 2022 at 12:48:45PM -0700, Saeed Mahameed wrote:
On 05 Apr 11:12, Leon Romanovsky wrote:
> From: Patrisious Haddad <phaddad@xxxxxxxxxx>
>
> Prior to this patch in the case that destroy_unmap_eq()
> failed and was called again, it triggered an additional call of

Where is it being failed and called again ? this shouldn't even be an
option, we try to keep mlx5 symmetrical, constructors and destructors are
supposed to be called only once in their respective positions.
the callers must be fixed to avoid re-entry, or change destructors to clear
up all resources even on failures, no matter what do not invent a reentry
protocols to mlx5 destructors.

It can happen when QP is exposed through DEVX interface. In that flow,
only FW knows about it and reference count all users. This means that
attempt to destroy such QP will fail, but mlx5_core is structured in
such way that all cleanup was done before calling to FW to get
success/fail response.

I wasn't talking about destroy_qp, actually destroy_qp is implemented the
way i am asking you to implement destroy_eq(); remove debugfs on first call
to destroy EQ, and drop the reentry logic from from mlx5_eq_destroy_generic
and destroy_async_eq.

EQ is a core/mlx5_ib resources, it's not exposed to user nor DEVX, it
shouldn't be subject to DEVX limitations.

Also looking at the destroy_qp implementation, it removes the debugfs
unconditionally even if the QP has ref count and removal will fail in FW.
just FYI.

For EQ I don't even understand why devx can cause ODP EQ removal to fail..
you must fix this at mlx5_ib layer, but for this patch, please drop the
re-entry and remove debugfs in destroy_eq, unconditionally.

For more detailed information, see this cover letter:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20200907120921.476363-1-leon%40kernel.org%2F&amp;data=04%7C01%7Csaeedm%40nvidia.com%7Cee8a0add0a154e055f8508da17a2d6fd%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637848285407413801%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=xD54MMVFSeONDeQyPHOinh93CPWjp2rUEL7F3izc210%3D&amp;reserved=0

<...>

> int mlx5_eq_destroy_generic(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
> {
> +	struct mlx5_eq_table *eq_table = dev->priv.eq_table;
> 	int err;
>
> 	if (IS_ERR(eq))
> 		return -EINVAL;
>
> -	err = destroy_async_eq(dev, eq);
> +	mutex_lock(&eq_table->lock);

Here you are inventing the re-entry. Please drop this and fix properly. And
avoid boolean parameters to mlx5 core
functions as much as possible, let's keep mlx5_core simple.

If after reading the link above, you were not convinced, let's take it offline.

I am not convinced, see above.

Thanks