Re: [PATCH rdma-next 4/4] IB/mlx5: Test and increment 'dying' atomically

Leon Romanovsky <leon@xxxxxxxxxx> · Thu, 24 Jan 2019 09:47:43 +0200

On Tue, Jan 22, 2019 at 05:18:42PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 22, 2019 at 09:16:11AM +0200, Leon Romanovsky wrote:
> > From: Moni Shoua <monis@xxxxxxxxxxxx>
> >
> > The variable 'dying' in struct ib_umem_odp is used 2 flows, not in an atomic
> > manner by the following scheme
> >
> > 1. read
> > 2. if zero proceed
> >
> > This exposes a race between the flows that can end up with 2 increments
> > of num_leaf_free but only one decrement that prevent the implicit mr
> > structure from being destroyed.
> >
> > This call trace is a result of such a race
> >
> >  INFO: task leakcatch_ibm.o:42313 blocked for more than 120 seconds.
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  leakcatch_ibm.o D 00001000004382e4 0 42313 42297 0x00040080
> >  Call Trace:
> >  [c000001e3b107870] [c000001e3b1078d0] 0xc000001e3b1078d0 (unreliable)
> >  [c000001e3b107a40] [c0000000000168dc] switch_to+0x25c/0x470
> >  [c000001e3b107aa0] [c000000000996854] __schedule+0x414/0xad0
> >  [c000001e3b107b80] [d000000023f278c4] mlx5_ib_free_implicit_mr+0xb4/0x110 [mlx5_ib]
> >  [c000001e3b107bf0] [d000000023f1c5c8] mlx5_ib_invalidate_mr+0x178/0x1e0 [mlx5_ib]
> >  [c000001e3b107c30] [d000000023f1c7d4] mlx5_ib_dereg_mr+0xe4/0x110 [mlx5_ib]
> >  [c000001e3b107c70] [d000000023b414c4] ib_dereg_mr+0x34/0x70 [ib_core]
> >  [c000001e3b107ca0] [d000000027de1fbc] ib_uverbs_cleanup_ucontext+0x4bc/0x6c0 [ib_uverbs]
> >  [c000001e3b107d20] [d000000027de2204] ib_uverbs_close+0x44/0x140 [ib_uverbs]
> >  [c000001e3b107d50] [c0000000003219e4] __fput+0xd4/0x2d0
> >  [c000001e3b107db0] [c00000000010f6d4] task_work_run+0xe4/0x160
> >  [c000001e3b107e00] [c000000000018c6c] do_notify_resume+0xcc/0x100
> >  [c000001e3b107e30] [c00000000000a7b0] ret_from_except_lite+0x5c/0x60
> >
> > Fix by making steps 1 and 2 above done in an atomic way.
> >
> > Fixes: 882214e2b128 ("IB/core: Implement support for MMU notifiers regarding on demand paging regions")
> > Signed-off-by: Moni Shoua <monis@xxxxxxxxxxxx>
> > Reviewed-by: Artemy Kovalyov <artemyko@xxxxxxxxxxxx>
> > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> >  drivers/infiniband/core/umem_odp.c |  2 +-
> >  drivers/infiniband/hw/mlx5/odp.c   | 10 ++++------
> >  include/rdma/ib_umem_odp.h         |  2 +-
> >  3 files changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> > index 0d154c5d12cc..679146352a3c 100644
> > +++ b/drivers/infiniband/core/umem_odp.c
> > @@ -113,7 +113,7 @@ static int ib_umem_notifier_release_trampoline(struct ib_umem_odp *umem_odp,
> >  	 * prevent any further fault handling on this MR.
> >  	 */
> >  	ib_umem_notifier_start_account(umem_odp);
> > -	umem_odp->dying = 1;
> > +	atomic_set(&umem_odp->dying, 1);
> >  	/* Make sure that the fact the umem is dying is out before we release
> >  	 * all pending page faults. */
> >  	smp_wmb();
>
> Is the smp_wmb still needed if there is an atomic, or it should it be
> one of the special atomic mbs? Or you don't need  need a barrier at
> all if the test/set bit API is used instead.

From what I read, atomic doesn't prevent from reordering and only
provides reliable oneshot read/write access. It looks like *_wmb()
call is still needed. Regarding "special atomic mbs", which one did you
have in mind?

>
> > diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> > index fa22b28e8e43..c1d258b86145 100644
> > +++ b/drivers/infiniband/hw/mlx5/odp.c
> > @@ -98,7 +98,7 @@ static int check_parent(struct ib_umem_odp *odp,
> >  {
> >  	struct mlx5_ib_mr *mr = odp->private;
> >
> > -	return mr && mr->parent == parent && !odp->dying;
> > +	return mr && mr->parent == parent && !atomic_read(&odp->dying);
> >  }
>
> Almost every time I see code trying to create a 'is_closed/dying'
> construct without locks it turns out to be racy and wrong.

No argue, totally agree.

>
> Why is it OK for dying to become 1 the instant after check_parent is
> called? If it is OK then why are we testing dying here?

In theory, we are supposed to be protected by "per_mm->umem_rwsem" rwlock
and it won't change without taking that lock.

>
> > @@ -551,10 +550,9 @@ static int mr_leaf_free(struct ib_umem_odp *umem_odp, u64 start, u64 end,
> >  	ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem),
> >  				    ib_umem_end(umem));
> >
> > -	if (umem_odp->dying)
> > +	if (!atomic_add_unless(&umem_odp->dying, 1, 1))
> >  		return 0;
>
> Why use an atomic? Isn't this algorithm just test_and_set_bit() /
> set_bit() and test_bit() ? They are atomics too.

set_bit/test_bit operates on long, while Moni uses atomic_t variable,
hence this change.

>
> > @@ -682,7 +680,7 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
> >
> >  out:
> >  	if (ret == -EAGAIN) {
> > -		if (implicit || !odp->dying) {
> > +		if (implicit || !atomic_read(&odp->dying)) {
> >  			unsigned long timeout =
> >  				msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT);
>
> Same question as check_parent

Actually, at the end of my answer, I had the feeling that the root cause for observed trace
is missed "per_mm->umem_rwsem" lock.

Moni???

Thanks

>
> Jason
Attachment:
signature.asc

Description: PGP signature