Re: [PATCH 0/2] bugfix for ipoib

Jason Gunthorpe <jgg@xxxxxxxx> · Mon, 20 Nov 2023 20:16:40 -0400

On Mon, Nov 20, 2023 at 09:34:59PM +0100, Jack Wang wrote:
> We run into queue timeout often with call trace as such:
> NETDEV WATCHDOG: ib0.beef (): transmit queue 26 timed out
> Call Trace:
> call_timer_fn+0x27/0x100
> __run_timers.part.0+0x1be/0x230
> ? mlx5_cq_tasklet_cb+0x6d/0x140 [mlx5_core]
> run_timer_softirq+0x26/0x50
> __do_softirq+0xbc/0x26d
> asm_call_irq_on_stack+0xf/0x20
> ib0.beef: transmit timeout: latency 10 msecs
> ib0.beef: queue stopped 0, tx_head 0, tx_tail 0, global_tx_head 0, global_tx_tail 0
> 
> The last two message repeated for days.

You shouldn't get tx timeouts and fully stuck queues like that, it
suggests something else is very wrong in that system.

> After cross check with Mellanox OFED, I noticed some bugfix are missing in
> upstream, hence I take the liberty to send them out.

Recovery is recovery, it is just RAS

Jason