Re: mlx5_ib_post_send stuck?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 01, 2017 at 08:04:36PM -0700, Robert LeBlanc wrote:
> I've been trying to track down a hang in iSERT that requires rebooting
> our target and is very inconvenient. Our environment is very
> complicated so I've tried to reproduce the problem on a much simpler
> config. I believe I have a situation that triggers the same backtrace
> (it is hard to know 100% because our production is on 4.4 and
> Infiniband and my test is on 4.9.0 with RoCE, but they look very
> similar). I found that there is a bug in iSER (initiator) that
> triggers the target hang with the ConnectX-4-LX cards. This commit
> fixes the initiator side, but does not fix the target side, so I keep
> leveraging the bug to track down the target bug. I include it here in
> the case it adds useful information.
>
> commit 24be606cd3b433412488ce962842489fe6e34a7a
> Author: Max Gurtovoy <maxg@xxxxxxxxxxxx>
> Date:   Wed Jan 18 00:40:39 2017 +0200
>
>     IB/iser: Fix sg_tablesize calculation
>
>     commit 1e5db6c31ade4150c2e2b1a21e39f776c38fea39 upstream.
>
>     For devices that can register page list that is bigger than
>     USHRT_MAX, we actually take the wrong value for sg_tablesize.
>     E.g: for CX4 max_fast_reg_page_list_len is 65536 (bigger than USHRT_MAX)
>     so we set sg_tablesize to 0 by mistake. Therefore, each IO that is
>     bigger than 4k splitted to "< 4k" chunks that cause performance degredation.
>     Remove wrong sg_tablesize assignment, and use the value that was set during
>     address resolution handler with the needed casting.
>
>     Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx>
>     Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
>     Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>
> diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c
> b/drivers/infiniband/ulp/iser/iscsi_iser.c
> index 64b3d11..140f3f3 100644
> --- a/drivers/infiniband/ulp/iser/iscsi_iser.c
> +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
> @@ -651,13 +651,6 @@ static void iscsi_iser_cleanup_task(struct
> iscsi_task *task)
>                                                    SHOST_DIX_GUARD_CRC);
>                 }
>
> -               /*
> -                * Limit the sg_tablesize and max_sectors based on the device
> -                * max fastreg page list length.
> -                */
> -               shost->sg_tablesize = min_t(unsigned short, shost->sg_tablesize,
> -
> ib_conn->device->ib_device->attrs.max_fast_reg_page_list_len);
> -
>                 if (iscsi_host_add(shost,
>                                    ib_conn->device->ib_device->dma_device)) {
>                         mutex_unlock(&iser_conn->state_mutex);
>
> Based on the backtraces produced by the target, it looks like there is
> something in mlx5_ib_post_send that may be wonky such that the send Q
> can't be drained.
>
> [  369.794100] INFO: task iscsi_np:6998 blocked for more than 120 seconds.
> [  369.794891]       Not tainted 4.9.0+ #5
> [  369.795689] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  369.796525] iscsi_np        D    0  6998      2 0x00000084
> [  369.797379]  ffff887f5e275d00 0000000000000000 ffff887f66650000
> ffff887f7f219300
> [  369.798275]  ffff887f3c691600 ffffc9003293fbf0 ffffffff816d86d5
> ffff88807ffdba00
> [  369.799190]  0000000200000000 0000000000000000 ffff887f3c691600
> 7fffffffffffffff
> [  369.800120] Call Trace:
> [  369.801041]  [<ffffffff816d86d5>] ? __schedule+0x195/0x630
> [  369.801977]  [<ffffffff816d8ba6>] schedule+0x36/0x80
> [  369.802921]  [<ffffffff816dbdbc>] schedule_timeout+0x21c/0x3a0
> [  369.803877]  [<ffffffff8135fcbe>] ? vsnprintf+0x34e/0x4d0
> [  369.804833]  [<ffffffff816d9682>] wait_for_completion+0xf2/0x130
> [  369.805760]  [<ffffffff810b84b0>] ? wake_up_q+0x80/0x80
> [  369.806666]  [<ffffffffa07a1701>] iscsit_stop_session+0x1b1/0x1c0
> [iscsi_target_mod]
> [  369.807600]  [<ffffffffa0792e72>]
> iscsi_check_for_session_reinstatement+0x1e2/0x270 [iscsi_target_mod]
> [  369.808572]  [<ffffffffa0795870>]
> iscsi_target_check_for_existing_instances+0x30/0x40 [iscsi_target_mod]
> [  369.809562]  [<ffffffffa07959b8>] iscsi_target_do_login+0x138/0x630
> [iscsi_target_mod]
> [  369.810562]  [<ffffffffa0796bce>]
> iscsi_target_start_negotiation+0x4e/0xa0 [iscsi_target_mod]
> [  369.811577]  [<ffffffffa079476e>]
> __iscsi_target_login_thread+0x83e/0xf20 [iscsi_target_mod]
> [  369.812597]  [<ffffffff816d8601>] ? __schedule+0xc1/0x630
> [  369.813639]  [<ffffffffa0794e50>] ?
> __iscsi_target_login_thread+0xf20/0xf20 [iscsi_target_mod]
> [  369.814722]  [<ffffffffa0794e74>]
> iscsi_target_login_thread+0x24/0x30 [iscsi_target_mod]
> [  369.815819]  [<ffffffff810abbb9>] kthread+0xd9/0xf0
> [  369.816861]  [<ffffffff810abae0>] ? kthread_park+0x60/0x60
> [  369.817909]  [<ffffffff816dd4d5>] ret_from_fork+0x25/0x30
> [  369.819118] INFO: task iscsi_trx:8458 blocked for more than 120 seconds.
> [  369.820435]       Not tainted 4.9.0+ #5
> [  369.821754] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  369.823135] iscsi_trx       D    0  8458      2 0x00000084
> [  369.824533]  ffff887f5e724d80 0000000000000000 ffff887f66659600
> ffff887f7f399300
> [  369.825984]  ffff883f5f7a9600 ffffc90035687b70 ffffffff816d86d5
> 0000000000000000
> [  369.827189]  ffff887f65e12400 ffffc90035687b50 ffff883f5f7a9600
> 7fffffffffffffff
> [  369.828283] Call Trace:
> [  369.829369]  [<ffffffff816d86d5>] ? __schedule+0x195/0x630
> [  369.830475]  [<ffffffff816d8ba6>] schedule+0x36/0x80
> [  369.831575]  [<ffffffff816dbdbc>] schedule_timeout+0x21c/0x3a0
> [  369.832698]  [<ffffffffa05b9e58>] ? mlx5_ib_post_send+0xc8/0x1750 [mlx5_ib]
> [  369.833819]  [<ffffffff816d9682>] wait_for_completion+0xf2/0x130
> [  369.834956]  [<ffffffff810b84b0>] ? wake_up_q+0x80/0x80
> [  369.836130]  [<ffffffffa04c7b10>] __ib_drain_sq+0x190/0x1c0 [ib_core]
> [  369.837234]  [<ffffffffa04c7960>] ? ib_sg_to_pages+0x1a0/0x1a0 [ib_core]
> [  369.838342]  [<ffffffffa04c7b65>] ib_drain_sq+0x25/0x30 [ib_core]
> [  369.839435]  [<ffffffffa04c7d72>] ib_drain_qp+0x12/0x30 [ib_core]
> [  369.840519]  [<ffffffffa049e5ff>] isert_wait_conn+0x5f/0x2d0 [ib_isert]
> [  369.841621]  [<ffffffffa07a1be7>]
> iscsit_close_connection+0x157/0x860 [iscsi_target_mod]
> [  369.842747]  [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180
> [  369.843877]  [<ffffffffa079031b>]
> iscsit_take_action_for_connection_exit+0x7b/0xf0 [iscsi_target_mod]
> [  369.845037]  [<ffffffffa07a1495>] iscsi_target_rx_thread+0x95/0xa0
> [iscsi_target_mod]
> [  369.846189]  [<ffffffffa07a1400>] ?
> iscsi_target_tx_thread+0x1d0/0x1d0 [iscsi_target_mod]
> [  369.847276]  [<ffffffff810abbb9>] kthread+0xd9/0xf0
> [  369.848333]  [<ffffffff810abae0>] ? kthread_park+0x60/0x60
> [  369.849391]  [<ffffffff816dd4d5>] ret_from_fork+0x25/0x30
>
> So I added some debug statements in the mlx5 driver.
>
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index d1e9218..ce5180e 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3812,6 +3812,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>         u8 next_fence = 0;
>         u8 fence;
>
> +       printk("Starting mlx5_ib_post_send.\n");
>         if (unlikely(ibqp->qp_type == IB_QPT_GSI))
>                 return mlx5_ib_gsi_post_send(ibqp, wr, bad_wr);
>
> @@ -3828,7 +3829,9 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>                 goto out;
>         }
>
> +       printk("mips: start iterating through wr.\n");
>         for (nreq = 0; wr; nreq++, wr = wr->next) {
> +               printk("mips: nreq=%d\n", nreq);
>                 if (unlikely(wr->opcode >= ARRAY_SIZE(mlx5_ib_opcode))) {
>                         mlx5_ib_warn(dev, "\n");
>                         err = -EINVAL;
> @@ -3847,12 +3850,14 @@ int mlx5_ib_post_send(struct ib_qp *ibqp,
> struct ib_send_wr *wr,
>
>                 err = begin_wqe(qp, &seg, &ctrl, wr, &idx, &size, nreq);
>                 if (err) {
> +                       printk("mips: begin_wqe error: %d\n", err);
>                         mlx5_ib_warn(dev, "\n");
>                         err = -ENOMEM;
>                         *bad_wr = wr;
>                         goto out;
>                 }
>
> +               printk("mips: Starting qp_type switch on %d\n", ibqp->qp_type);
>                 switch (ibqp->qp_type) {
>                 case IB_QPT_XRC_INI:
>                         xrc = seg;
> @@ -3994,6 +3999,8 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>                                 seg = mlx5_get_send_wqe(qp, 0);
>                         break;
>                 case IB_QPT_UD:
> +                       printk("mips: Starting IB_QPT_UD case.\n");
> +                       dump_wqe(qp, idx, size);
>                         set_datagram_seg(seg, wr);
>                         seg += sizeof(struct mlx5_wqe_datagram_seg);
>                         size += sizeof(struct mlx5_wqe_datagram_seg) / 16;
> @@ -4015,6 +4022,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>                                 if (unlikely((seg == qend)))
>                                         seg = mlx5_get_send_wqe(qp, 0);
>                         }
> +                       printk("mips: Finished IB_QPT_UD case.\n");
>                         break;
>                 case MLX5_IB_QPT_REG_UMR:
>                         if (wr->opcode != MLX5_IB_WR_UMR) {
> @@ -4070,11 +4078,12 @@ int mlx5_ib_post_send(struct ib_qp *ibqp,
> struct ib_send_wr *wr,
>                            get_fence(fence, wr), next_fence,
>                            mlx5_ib_opcode[wr->opcode]);
>  skip_psv:
> -               if (0)
> +               if (1)
>                         dump_wqe(qp, idx, size);
>         }
>
>  out:
> +       printk("mips: out.\n");
>         if (likely(nreq)) {
>                 qp->sq.head += nreq;
>
> @@ -4115,6 +4124,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>
>         spin_unlock_irqrestore(&qp->sq.lock, flags);
>
> +       printk("Finished mlx5_ib_post_send.\n");
>         return err;
>  }
>
> And running `echo "func dump_wqe +p" >
> /sys/kernel/debug/dynamic_debug/control` gives:
>
> [  170.268608] Starting mlx5_ib_post_send.
> [  170.268609] mips: start iterating through wr.
> [  170.268609] mips: nreq=0
> [  170.268610] mips: Starting qp_type switch on 2
> [  170.268611] dump wqe at ffff887f30fa78c0
> [  170.268612] 0089a30a 00011b02 00000008 00000000
> [  170.268613] 0000004c 00183b51 0000007f 30f53858
> [  170.268613] mips: out.
> [  170.268614] Finished mlx5_ib_post_send.
> [  170.268615] Starting mlx5_ib_post_send.
> [  170.268615] mips: start iterating through wr.
> [  170.268616] mips: nreq=0
> [  170.268616] mips: Starting qp_type switch on 2
> [  170.268617] dump wqe at ffff887f30fa7900
> [  170.268618] 0089a40a 00011b02 00000008 00000000
> [  170.268619] 0000004c 00183b51 0000007f 30f5b700
> [  170.268619] mips: out.
> [  170.268620] Finished mlx5_ib_post_send.
> [  170.268621] Starting mlx5_ib_post_send.
> [  170.268621] mips: start iterating through wr.
> [  170.268622] mips: nreq=0
> [  170.268623] mips: Starting qp_type switch on 2
> [  170.268623] dump wqe at ffff887f30fa7940
> [  170.268624] 0089a50a 00011b02 00000008 00000000
> [  170.268625] 0000004c 00183b51 0000007f 30f52038
> [  170.268626] mips: out.
> [  170.268626] Finished mlx5_ib_post_send.
> [  171.175601] Starting mlx5_ib_post_send.
> [  171.176342] mips: start iterating through wr.
> [  171.176620] systemd-journald[871]: /dev/kmsg buffer overrun, some
> messages lost.
> [  171.177814] mips: nreq=0
> [  171.178628] mips: Starting qp_type switch on 2
> [  171.179427] dump wqe at ffff887f30fa7980
> [  171.179429] 0089a60a 00011b02 00000008 00000000
> [  171.179431] 0000004c 00183b51 0000007f 30f4eff8
> [  171.179432] mips: out.
> [  171.180203] Finished mlx5_ib_post_send.
> [  181.674314] Starting mlx5_ib_post_send.
> [  181.675136] Starting mlx5_ib_post_send.
> [  181.675926] mips: start iterating through wr.
> [  181.676721] mips: nreq=0
> [  181.677509] mips: Starting qp_type switch on 4
> [  181.678323] mips: Starting IB_QPT_UD case.
> [  181.679133] dump wqe at ffff887f4ac04180
> [  181.679135] 00000000 00000000 00000008 00000000
> [  181.679136] mips: Finished IB_QPT_UD case.
> [  181.679933] dump wqe at ffff887f4ac04180
> [  181.679934] 0000060a 00009706 00000008 00000000
> [  181.679936] 80000000 00000000 80000001 30000000
> [  181.679937] 00000000 0cc47a88 07310040 40200000
> [  181.679939] 00000000 00000000 0000ffff c0a80b0e
> [  181.679940] 00000018 00001202 0000007f 5d9f8000
> [  181.679941] 000000e8 00001202 0000007f 5d9f8018
> [  181.679942] mips: out.
> [  181.680728] Finished mlx5_ib_post_send.
> [  181.681588] Starting mlx5_ib_post_send.
> [  181.682390] mips: start iterating through wr.
> [  181.683197] mips: nreq=0
> [  181.683997] mips: Starting qp_type switch on 2
> [  181.684807] dump wqe at ffff887f30fa79c0
> [  181.684809] 0089a708 00011b02 00000000 00000000
> [  181.684810] 00000000 00000006 00000000 00000000
> [  181.684811] mips: out.
> [  181.685626] Finished mlx5_ib_post_send.
> [  183.733919] Starting mlx5_ib_post_send.
> [  183.734734] Starting mlx5_ib_post_send.
> [  183.735530] mips: start iterating through wr.
> [  183.736329] mips: nreq=0
> [  183.737122] mips: Starting qp_type switch on 4
> [  183.737926] mips: Starting IB_QPT_UD case.
> [  183.738724] dump wqe at ffff887f4ac04200
> [  183.738726] 00000000 00000000 00000008 00000000
> [  183.738727] mips: Finished IB_QPT_UD case.
> [  183.739517] dump wqe at ffff887f4ac04200
> [  183.739519] 0000080a 00009706 00000008 00000000
> [  183.739521] 80000000 00000000 80000001 30000000
> [  183.739522] 00000000 0cc47a88 07310040 40200000
> [  183.739523] 00000000 00000000 0000ffff c0a80b0e
> [  183.739525] 00000018 00001202 0000007f 653ddc00
> [  183.739526] 000000e8 00001202 0000007f 653ddc18
> [  183.739527] mips: out.
> [  183.740315] Finished mlx5_ib_post_send.
> [  198.786376] iSCSI Login timeout on Network Portal 0.0.0.0:3260
> [  200.962188] Starting mlx5_ib_post_send.
> [  200.963030] Starting mlx5_ib_post_send.
> [  200.963845] mips: start iterating through wr.
> [  200.964672] mips: nreq=0
> [  200.965507] mips: Starting qp_type switch on 4
> [  200.966338] mips: Starting IB_QPT_UD case.
> [  200.967171] dump wqe at ffff887f4ac04280
> [  200.967173] 00000000 00000000 00000008 00000000
> [  200.967174] mips: Finished IB_QPT_UD case.
> [  200.967995] dump wqe at ffff887f4ac04280
> [  200.967997] 00000a0a 00009706 00000008 00000000
> [  200.967998] 80000000 00000000 80000001 30000000
> [  200.968000] 00000000 0cc47a88 07310040 40200000
> [  200.968001] 00000000 00000000 0000ffff c0a80b0e
> [  200.968003] 00000018 00001202 0000007f 653dd400
> [  200.968004] 000000e8 00001202 0000007f 653dd418
> [  200.968005] mips: out.
> [  200.968815] Finished mlx5_ib_post_send.
>
> During the test it seems that for the most part wqe stays pretty
> small, but when the 'issue' starts to happen (at 181) there is more
> stuff there. The interesting thing is that by adding these statements,
> the test is able to run longer than without them. It usually hangs
> within the first 1 GB, but with this patch, it goes for a few GB
> before hanging.
>
> Sagi has been very helpful so far, but he thinks there might be
> something in the driver preventing the queue from being drained. I
> don't understand the output to know what it means. Based on what I've
> seen during the disconnection of iSER, it seemed that qp_type 4 is
> what was hanging, but the problem may have been on a different type
> before then. I tried dumping wqe for each op, but it would cause the
> box to be unresponsive.
>
> I'd appreciate any ideas on how to troubleshoot this.

If I read your dumped WQE correctly, you have:
 * local_qp_operation error
 * RDE_WQE_FETCH_TPT - DP violation - error in fetch WQE in RX in PD.

Max,
Am I right?

>
> Thank you,
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux