On Tue, Jan 22, 2019 at 05:18:42PM +0000, Jason Gunthorpe wrote: > On Tue, Jan 22, 2019 at 09:16:11AM +0200, Leon Romanovsky wrote: > > From: Moni Shoua <monis@xxxxxxxxxxxx> > > > > The variable 'dying' in struct ib_umem_odp is used 2 flows, not in an atomic > > manner by the following scheme > > > > 1. read > > 2. if zero proceed > > > > This exposes a race between the flows that can end up with 2 increments > > of num_leaf_free but only one decrement that prevent the implicit mr > > structure from being destroyed. > > > > This call trace is a result of such a race > > > > INFO: task leakcatch_ibm.o:42313 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > leakcatch_ibm.o D 00001000004382e4 0 42313 42297 0x00040080 > > Call Trace: > > [c000001e3b107870] [c000001e3b1078d0] 0xc000001e3b1078d0 (unreliable) > > [c000001e3b107a40] [c0000000000168dc] switch_to+0x25c/0x470 > > [c000001e3b107aa0] [c000000000996854] __schedule+0x414/0xad0 > > [c000001e3b107b80] [d000000023f278c4] mlx5_ib_free_implicit_mr+0xb4/0x110 [mlx5_ib] > > [c000001e3b107bf0] [d000000023f1c5c8] mlx5_ib_invalidate_mr+0x178/0x1e0 [mlx5_ib] > > [c000001e3b107c30] [d000000023f1c7d4] mlx5_ib_dereg_mr+0xe4/0x110 [mlx5_ib] > > [c000001e3b107c70] [d000000023b414c4] ib_dereg_mr+0x34/0x70 [ib_core] > > [c000001e3b107ca0] [d000000027de1fbc] ib_uverbs_cleanup_ucontext+0x4bc/0x6c0 [ib_uverbs] > > [c000001e3b107d20] [d000000027de2204] ib_uverbs_close+0x44/0x140 [ib_uverbs] > > [c000001e3b107d50] [c0000000003219e4] __fput+0xd4/0x2d0 > > [c000001e3b107db0] [c00000000010f6d4] task_work_run+0xe4/0x160 > > [c000001e3b107e00] [c000000000018c6c] do_notify_resume+0xcc/0x100 > > [c000001e3b107e30] [c00000000000a7b0] ret_from_except_lite+0x5c/0x60 > > > > Fix by making steps 1 and 2 above done in an atomic way. > > > > Fixes: 882214e2b128 ("IB/core: Implement support for MMU notifiers regarding on demand paging regions") > > Signed-off-by: Moni Shoua <monis@xxxxxxxxxxxx> > > Reviewed-by: Artemy Kovalyov <artemyko@xxxxxxxxxxxx> > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx> > > drivers/infiniband/core/umem_odp.c | 2 +- > > drivers/infiniband/hw/mlx5/odp.c | 10 ++++------ > > include/rdma/ib_umem_odp.h | 2 +- > > 3 files changed, 6 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c > > index 0d154c5d12cc..679146352a3c 100644 > > +++ b/drivers/infiniband/core/umem_odp.c > > @@ -113,7 +113,7 @@ static int ib_umem_notifier_release_trampoline(struct ib_umem_odp *umem_odp, > > * prevent any further fault handling on this MR. > > */ > > ib_umem_notifier_start_account(umem_odp); > > - umem_odp->dying = 1; > > + atomic_set(&umem_odp->dying, 1); > > /* Make sure that the fact the umem is dying is out before we release > > * all pending page faults. */ > > smp_wmb(); > > Is the smp_wmb still needed if there is an atomic, or it should it be > one of the special atomic mbs? Or you don't need need a barrier at > all if the test/set bit API is used instead. From what I read, atomic doesn't prevent from reordering and only provides reliable oneshot read/write access. It looks like *_wmb() call is still needed. Regarding "special atomic mbs", which one did you have in mind? > > > diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c > > index fa22b28e8e43..c1d258b86145 100644 > > +++ b/drivers/infiniband/hw/mlx5/odp.c > > @@ -98,7 +98,7 @@ static int check_parent(struct ib_umem_odp *odp, > > { > > struct mlx5_ib_mr *mr = odp->private; > > > > - return mr && mr->parent == parent && !odp->dying; > > + return mr && mr->parent == parent && !atomic_read(&odp->dying); > > } > > Almost every time I see code trying to create a 'is_closed/dying' > construct without locks it turns out to be racy and wrong. No argue, totally agree. > > Why is it OK for dying to become 1 the instant after check_parent is > called? If it is OK then why are we testing dying here? In theory, we are supposed to be protected by "per_mm->umem_rwsem" rwlock and it won't change without taking that lock. > > > @@ -551,10 +550,9 @@ static int mr_leaf_free(struct ib_umem_odp *umem_odp, u64 start, u64 end, > > ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem), > > ib_umem_end(umem)); > > > > - if (umem_odp->dying) > > + if (!atomic_add_unless(&umem_odp->dying, 1, 1)) > > return 0; > > Why use an atomic? Isn't this algorithm just test_and_set_bit() / > set_bit() and test_bit() ? They are atomics too. set_bit/test_bit operates on long, while Moni uses atomic_t variable, hence this change. > > > @@ -682,7 +680,7 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr, > > > > out: > > if (ret == -EAGAIN) { > > - if (implicit || !odp->dying) { > > + if (implicit || !atomic_read(&odp->dying)) { > > unsigned long timeout = > > msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT); > > Same question as check_parent Actually, at the end of my answer, I had the feeling that the root cause for observed trace is missed "per_mm->umem_rwsem" lock. Moni??? Thanks > > Jason
Attachment:
signature.asc
Description: PGP signature