Re: Hang in ipoib_mcast_stop_thread

Erez Shitrit <erezsh@xxxxxxxxxxxxxxxxxx> · Mon, 6 Jun 2016 15:57:05 +0300

On Mon, Jun 6, 2016 at 3:11 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>
>
> On 06/06/2016 02:51 PM, Erez Shitrit wrote:
>> Hi Nikolay
>>
>> What was the scenario? (Do you have a way to reproduce that?)
>
> I have no way of reliably reproducing that, but I have experienced
> misteryous hangs in the ipoib stack e.g.
>
> https://marc.info/?l=linux-rdma&m=145915284709901

I am not sure this is the same case here, here it might be according
to missing event, or CM issue.
(anyway, missing events in the HW can explain both cases..)

>
> when the connection to the infiniband network is lost. Unfortunately I
> cannot even figure out where to being debugging this ;(
>
> As a matter of fact I currently have one server which complains:
>
> ib0: transmit timeout: latency 139718550 msecs
> ib0: queue stopped 1, tx_head 242, tx_tail 114
>
> yet, "iblinkinfo" can show all the nodes in the infiniband network.
>
>> Which IB card are you using?
>
> ibstat
> CA 'qib0'
>         CA type: InfiniPath_QLE7342
>         Number of ports: 2
>         Firmware version:
>         Hardware version: 2
>         Node GUID: 0x001175000077b918
>         System image GUID: 0x001175000077b918
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 40
>                 Base lid: 68
>                 LMC: 0
>                 SM lid: 61
>                 Capability mask: 0x07610868
>                 Port GUID: 0x001175000077b918
>                 Link layer: InfiniBand
>
>
>
>>
>> Thanks, Erez
>>
>> On Mon, Jun 6, 2016 at 12:09 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>>> Hello,
>>>
>>> [Tejun, I have cc'ed you since I'm not familiar with the internals of
>>> the workqueue code and cannot comment whether the the mcast_task was
>>> queued or and whether there were any tasks in the workqueue based on the
>>> structures I've presented below, could you please comment on that?]
>>>
>>> I've been running an infiniband network on 4.4.10 and recently due to
>>> the infiniband failing (I presume) and having the following in my logs:
>>>
>>> [128340.462147] ib0: transmit timeout: latency 45014865 msecs
>>> [128340.462151] ib0: queue stopped 1, tx_head 711, tx_tail 583
>>> [128341.461122] ib0: transmit timeout: latency 45015864 msecs
>>> [128341.461126] ib0: queue stopped 1, tx_head 711, tx_tail 583
>>> [128342.461053] ib0: transmit timeout: latency 45016864 msecs
>>> [128342.461056] ib0: queue stopped 1, tx_head 711, tx_tail 583

This is the tx_timout callback from the IPoIB driver, it indicates
that there are packets (skb's) that held in the driver more than
watchdog time (5 sec probably) and the tx queue is locked by the
driver.
it can happened while the HW suffers stress and failed to dequeue the tx queue.
It can happened, and it is should be rare (depends on your specific HW)

>>>
>>> I had various commands hang due to rtnl_lock being held by an ip
>>> command, trying to shutdown the infiniband interface:
>>>
>>> PID: 30572 TASK: ffff881bb6bf6e00 CPU: 15 COMMAND: "ip"
>>> #0 [ffff88027f50b390] __schedule at ffffffff81631421
>>> #1 [ffff88027f50b430] schedule at ffffffff81631cc7
>>> #2 [ffff88027f50b450] schedule_timeout at ffffffff81634c89
>>> #3 [ffff88027f50b4e0] wait_for_completion at ffffffff81632d73
>>> #4 [ffff88027f50b550] flush_workqueue at ffffffff8106c761
>>> #5 [ffff88027f50b620] ipoib_mcast_stop_thread at ffffffffa02e0bf2 [ib_ipoib]
>>> #6 [ffff88027f50b650] ipoib_ib_dev_down at ffffffffa02de7d6 [ib_ipoib]
>>> #7 [ffff88027f50b670] ipoib_stop at ffffffffa02dc208 [ib_ipoib]
>>> #8 [ffff88027f50b6a0] __dev_close_many at ffffffff81562692
>>> #9 [ffff88027f50b6c0] __dev_close at ffffffff81562716
>>> #10 [ffff88027f50b6f0] __dev_change_flags at ffffffff815630fc
>>> #11 [ffff88027f50b730] dev_change_flags at ffffffff81563207
>>> #12 [ffff88027f50b760] do_setlink at ffffffff81576f5f
>>> #13 [ffff88027f50b860] rtnl_newlink at ffffffff81578afb
>>> #14 [ffff88027f50bb10] rtnetlink_rcv_msg at ffffffff81577ae5
>>> #15 [ffff88027f50bb90] netlink_rcv_skb at ffffffff8159a373
>>> #16 [ffff88027f50bbc0] rtnetlink_rcv at ffffffff81577bb5
>>> #17 [ffff88027f50bbe0] netlink_unicast at ffffffff81599e98
>>> #18 [ffff88027f50bc40] netlink_sendmsg at ffffffff8159aabe
>>> #19 [ffff88027f50bd00] sock_sendmsg at ffffffff81548ce9
>>> #20 [ffff88027f50bd20] ___sys_sendmsg at ffffffff8154ae78
>>> #21 [ffff88027f50beb0] __sys_sendmsg at ffffffff8154b059
>>> #22 [ffff88027f50bf40] sys_sendmsg at ffffffff8154b0a9
>>> #23 [ffff88027f50bf50] entry_SYSCALL_64_fastpath at ffffffff81635c57
>>>
>>>
>>> So clearly ipoib_mcast_stop_thread has hung on trying to stop the
>>> multicast thread. Here is the state of the ipoib_wq:
>>>
>>> struct workqueue_struct {
>>>   pwqs = {
>>>     next = 0xffff883ff1196770,
>>>     prev = 0xffff883ff1196770
>>>   },
>>>   list = {
>>>     next = 0xffff883fef5bea10,
>>>     prev = 0xffff883fef201010
>>>   },
>>>   mutex = {
>>>     count = {
>>>       counter = 1
>>>     },
>>>     wait_lock = {
>>>       {
>>>         rlock = {
>>>           raw_lock = {
>>>             val = {
>>>               counter = 0
>>>             }
>>>           }
>>>         }
>>>       }
>>>     },
>>>     wait_list = {
>>>       next = 0xffff883fef200c28,
>>>       prev = 0xffff883fef200c28
>>>     },
>>>     owner = 0x0,
>>>     osq = {
>>>       tail = {
>>>         counter = 0
>>>       }
>>>     }
>>>   },
>>>   work_color = 1,
>>>   flush_color = 0,
>>>   nr_pwqs_to_flush = {
>>>     counter = 1
>>>   },
>>>   first_flusher = 0xffff88027f50b568,
>>>   flusher_queue = {
>>>     next = 0xffff883fef200c60,
>>>     prev = 0xffff883fef200c60
>>>   },
>>>   flusher_overflow = {
>>>     next = 0xffff883fef200c70,
>>>     prev = 0xffff883fef200c70
>>>   },
>>>   maydays = {
>>>     next = 0xffff883fef200c80,
>>>     prev = 0xffff883fef200c80
>>>   },
>>>   rescuer = 0xffff883feb073680,
>>>   nr_drainers = 0,
>>>   saved_max_active = 1,
>>>   unbound_attrs = 0xffff883fef2b5c20,
>>>   dfl_pwq = 0xffff883ff1196700,
>>>   wq_dev = 0x0,
>>>   name =
>>> "ipoib_wq\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>>   rcu = {
>>>     next = 0x0,
>>>     func = 0x0
>>>   },
>>>   flags = 131082,
>>>   cpu_pwqs = 0x0,
>>>
>>> Also the state of the mcast_task member:
>>>
>>> crash> struct ipoib_dev_priv.mcast_task ffff883fed6ec700
>>>   mcast_task = {
>>>     work = {
>>>       data = {
>>>         counter = 3072
>>>       },
>>>       entry = {
>>>         next = 0xffff883fed6ec888,
>>>         prev = 0xffff883fed6ec888
>>>       },
>>>       func = 0xffffffffa02e1220 <ipoib_mcast_join_task>
>>>     },
>>>     timer = {
>>>       entry = {
>>>         next = 0x0,
>>>         pprev = 0x0
>>>       },
>>>       expires = 0,
>>>       function = 0xffffffff8106dd20 <delayed_work_timer_fn>,
>>>       data = 18446612406880618624,
>>>       flags = 2097164,
>>>       slack = -1
>>>     },
>>>     wq = 0x0,
>>>     cpu = 0
>>>   }
>>>
>>>
>>> flush_workqueue is essentially waiting on
>>> wait_for_completion(&this_flusher.done) in flush_workqueue, which
>>> apparently never returned. I'm assuming this is due to the general
>>> unavailability of the infiniband network. However, I think it's wrong,
>>> in case of infiniband being down, the whole server to be rendered
>>> unresponsive, due to rtnl_lock being held. Do you think it is possible
>>> to rework this code to prevent it hanging in case the workqueue cannot
>>> be flushed? Furthermore, do you think it's feasible to put code in
>>> ipoib_mcast_join_task to detect such situations and not re-arm itself
>>> and then use cancel_delayed_work_sync ?
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html