On Mon, Nov 20, 2023 at 09:34:59PM +0100, Jack Wang wrote: > We run into queue timeout often with call trace as such: > NETDEV WATCHDOG: ib0.beef (): transmit queue 26 timed out > Call Trace: > call_timer_fn+0x27/0x100 > __run_timers.part.0+0x1be/0x230 > ? mlx5_cq_tasklet_cb+0x6d/0x140 [mlx5_core] > run_timer_softirq+0x26/0x50 > __do_softirq+0xbc/0x26d > asm_call_irq_on_stack+0xf/0x20 > ib0.beef: transmit timeout: latency 10 msecs > ib0.beef: queue stopped 0, tx_head 0, tx_tail 0, global_tx_head 0, global_tx_tail 0 > > The last two message repeated for days. You shouldn't get tx timeouts and fully stuck queues like that, it suggests something else is very wrong in that system. > After cross check with Mellanox OFED, I noticed some bugfix are missing in > upstream, hence I take the liberty to send them out. Recovery is recovery, it is just RAS Jason