Re: [PATCH rdma-next 4/4] IB/mlx5: Test and increment 'dying' atomically

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 24, 2019 at 9:48 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> On Tue, Jan 22, 2019 at 05:18:42PM +0000, Jason Gunthorpe wrote:
> > On Tue, Jan 22, 2019 at 09:16:11AM +0200, Leon Romanovsky wrote:
> > > From: Moni Shoua <monis@xxxxxxxxxxxx>
> > >
> > > The variable 'dying' in struct ib_umem_odp is used 2 flows, not in an atomic
> > > manner by the following scheme
> > >
> > > 1. read
> > > 2. if zero proceed
> > >
> > > This exposes a race between the flows that can end up with 2 increments
> > > of num_leaf_free but only one decrement that prevent the implicit mr
> > > structure from being destroyed.
> > >
> > > This call trace is a result of such a race
> > >
> > >  INFO: task leakcatch_ibm.o:42313 blocked for more than 120 seconds.
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  leakcatch_ibm.o D 00001000004382e4 0 42313 42297 0x00040080
> > >  Call Trace:
> > >  [c000001e3b107870] [c000001e3b1078d0] 0xc000001e3b1078d0 (unreliable)
> > >  [c000001e3b107a40] [c0000000000168dc] switch_to+0x25c/0x470
> > >  [c000001e3b107aa0] [c000000000996854] __schedule+0x414/0xad0
> > >  [c000001e3b107b80] [d000000023f278c4] mlx5_ib_free_implicit_mr+0xb4/0x110 [mlx5_ib]
> > >  [c000001e3b107bf0] [d000000023f1c5c8] mlx5_ib_invalidate_mr+0x178/0x1e0 [mlx5_ib]
> > >  [c000001e3b107c30] [d000000023f1c7d4] mlx5_ib_dereg_mr+0xe4/0x110 [mlx5_ib]
> > >  [c000001e3b107c70] [d000000023b414c4] ib_dereg_mr+0x34/0x70 [ib_core]
> > >  [c000001e3b107ca0] [d000000027de1fbc] ib_uverbs_cleanup_ucontext+0x4bc/0x6c0 [ib_uverbs]
> > >  [c000001e3b107d20] [d000000027de2204] ib_uverbs_close+0x44/0x140 [ib_uverbs]
> > >  [c000001e3b107d50] [c0000000003219e4] __fput+0xd4/0x2d0
> > >  [c000001e3b107db0] [c00000000010f6d4] task_work_run+0xe4/0x160
> > >  [c000001e3b107e00] [c000000000018c6c] do_notify_resume+0xcc/0x100
> > >  [c000001e3b107e30] [c00000000000a7b0] ret_from_except_lite+0x5c/0x60
> > >
> > > Fix by making steps 1 and 2 above done in an atomic way.
> > >
> > > Fixes: 882214e2b128 ("IB/core: Implement support for MMU notifiers regarding on demand paging regions")
> > > Signed-off-by: Moni Shoua <monis@xxxxxxxxxxxx>
> > > Reviewed-by: Artemy Kovalyov <artemyko@xxxxxxxxxxxx>
> > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
> > >  drivers/infiniband/core/umem_odp.c |  2 +-
> > >  drivers/infiniband/hw/mlx5/odp.c   | 10 ++++------
> > >  include/rdma/ib_umem_odp.h         |  2 +-
> > >  3 files changed, 6 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> > > index 0d154c5d12cc..679146352a3c 100644
> > > +++ b/drivers/infiniband/core/umem_odp.c
> > > @@ -113,7 +113,7 @@ static int ib_umem_notifier_release_trampoline(struct ib_umem_odp *umem_odp,
> > >      * prevent any further fault handling on this MR.
> > >      */
> > >     ib_umem_notifier_start_account(umem_odp);
> > > -   umem_odp->dying = 1;
> > > +   atomic_set(&umem_odp->dying, 1);
> > >     /* Make sure that the fact the umem is dying is out before we release
> > >      * all pending page faults. */
> > >     smp_wmb();
> >
> > Is the smp_wmb still needed if there is an atomic, or it should it be
> > one of the special atomic mbs? Or you don't need  need a barrier at
> > all if the test/set bit API is used instead.
>
> From what I read, atomic doesn't prevent from reordering and only
> provides reliable oneshot read/write access. It looks like *_wmb()
> call is still needed. Regarding "special atomic mbs", which one did you
> have in mind?
>
> >
> > > diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> > > index fa22b28e8e43..c1d258b86145 100644
> > > +++ b/drivers/infiniband/hw/mlx5/odp.c
> > > @@ -98,7 +98,7 @@ static int check_parent(struct ib_umem_odp *odp,
> > >  {
> > >     struct mlx5_ib_mr *mr = odp->private;
> > >
> > > -   return mr && mr->parent == parent && !odp->dying;
> > > +   return mr && mr->parent == parent && !atomic_read(&odp->dying);
> > >  }
> >
> > Almost every time I see code trying to create a 'is_closed/dying'
> > construct without locks it turns out to be racy and wrong.
>
> No argue, totally agree.
>
> >
> > Why is it OK for dying to become 1 the instant after check_parent is
> > called? If it is OK then why are we testing dying here?
>
> In theory, we are supposed to be protected by "per_mm->umem_rwsem" rwlock
> and it won't change without taking that lock.
>
> >
> > > @@ -551,10 +550,9 @@ static int mr_leaf_free(struct ib_umem_odp *umem_odp, u64 start, u64 end,
> > >     ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem),
> > >                                 ib_umem_end(umem));
> > >
> > > -   if (umem_odp->dying)
> > > +   if (!atomic_add_unless(&umem_odp->dying, 1, 1))
> > >             return 0;
> >
> > Why use an atomic? Isn't this algorithm just test_and_set_bit() /
> > set_bit() and test_bit() ? They are atomics too.
>
> set_bit/test_bit operates on long, while Moni uses atomic_t variable,
> hence this change.
>
> >
> > > @@ -682,7 +680,7 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
> > >
> > >  out:
> > >     if (ret == -EAGAIN) {
> > > -           if (implicit || !odp->dying) {
> > > +           if (implicit || !atomic_read(&odp->dying)) {
> > >                     unsigned long timeout =
> > >                             msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT);
> >
> > Same question as check_parent
>
> Actually, at the end of my answer, I had the feeling that the root cause for observed trace
> is missed "per_mm->umem_rwsem" lock.
>
> Moni???
>
The purpose in this fix was to prevent a specific bad from happening:
num_leaf_free from being incremented too many times and causing the
flow to wait forever in

        wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));

This could happen because there were 2 flows that looked like

----------------------------------- start critical section
-----------------------------------
        if (umem_odp->dying)
                return 0;
                                                   <<< dying could
change from 0 to 1 here
        WRITE_ONCE(umem_odp->dying, 1);
----------------------------------- end  critical section
-----------------------------------
        atomic_inc(&imr->num_leaf_free);

Replacing the critical section with atomic_add_unless() solves the problem IMO

I don't think that doing it under lock is required here.

If there are other critical sections that aren't guarded yet they need
to be fixed but not necessarily in this patch



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux