Re: [PATCH mlx5-next 1/3] net/mlx5: Nullify eq->dbg and qp->dbg pointers post destruction

Leon Romanovsky <leon@xxxxxxxxxx> · Sun, 10 Apr 2022 10:58:16 +0300

On Fri, Apr 08, 2022 at 12:30:35PM -0700, Saeed Mahameed wrote:
> On 06 Apr 10:55, Leon Romanovsky wrote:
> > On Tue, Apr 05, 2022 at 12:48:45PM -0700, Saeed Mahameed wrote:
> > > On 05 Apr 11:12, Leon Romanovsky wrote:
> > > > From: Patrisious Haddad <phaddad@xxxxxxxxxx>
> > > >
> > > > Prior to this patch in the case that destroy_unmap_eq()
> > > > failed and was called again, it triggered an additional call of
> > > 
> > > Where is it being failed and called again ? this shouldn't even be an
> > > option, we try to keep mlx5 symmetrical, constructors and destructors are
> > > supposed to be called only once in their respective positions.
> > > the callers must be fixed to avoid re-entry, or change destructors to clear
> > > up all resources even on failures, no matter what do not invent a reentry
> > > protocols to mlx5 destructors.
> > 
> > It can happen when QP is exposed through DEVX interface. In that flow,
> > only FW knows about it and reference count all users. This means that
> > attempt to destroy such QP will fail, but mlx5_core is structured in
> > such way that all cleanup was done before calling to FW to get
> > success/fail response.
> 
> I wasn't talking about destroy_qp, actually destroy_qp is implemented the
> way i am asking you to implement destroy_eq(); remove debugfs on first call
> to destroy EQ, and drop the reentry logic from from mlx5_eq_destroy_generic
> and destroy_async_eq.
> 
> EQ is a core/mlx5_ib resources, it's not exposed to user nor DEVX, it
> shouldn't be subject to DEVX limitations.

I tend to agree with you. I'll take another look on it and resubmit.

> 
> Also looking at the destroy_qp implementation, it removes the debugfs
> unconditionally even if the QP has ref count and removal will fail in FW.
> just FYI.

Right, we don't care about debugfs.

> 
> For EQ I don't even understand why devx can cause ODP EQ removal to fail..
> you must fix this at mlx5_ib layer, but for this patch, please drop the
> re-entry and remove debugfs in destroy_eq, unconditionally.

The reason to complexity is not debugfs, but an existence of
"mlx5_frag_buf_free(dev, &eq->frag_buf);" line, after FW command is
executed.

We need to separate to two flows: the one that can tolerate FW cmd failures
and the one that can't. If you don't add "reentry" flag, you can (theoretically)
find yourself leaking ->frag_buf in the flows that don't know how to reentry.

I'll resubmit.

Thanks