On Sat, 2022-08-20 at 21:55 +0000, Chuck Lever III wrote: > Hi- > > This warning just popped on a stuck NFS/RDMA mount (the Ethernet > switch > port VLAN settings were not correct): > > Aug 20 17:12:05 bazille.1015granger.net kernel: workqueue: > WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing > !WQ_MEM_RECLAI> > Aug 20 17:12:05 bazille.1015granger.net kernel: WARNING: CPU: 0 PID: > 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca > > Aug 20 17:12:05 bazille.1015granger.net kernel: Workqueue: xprtiod > xprt_rdma_connect_worker [rpcrdma] > > Aug 20 17:12:05 bazille.1015granger.net kernel: Call Trace: > Aug 20 17:12:05 bazille.1015granger.net kernel: <TASK> > Aug 20 17:12:05 bazille.1015granger.net kernel: > __flush_work.isra.0+0xaf/0x188 > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > _raw_spin_lock_irqsave+0x2c/0x37 > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > lock_timer_base+0x38/0x5f > Aug 20 17:12:05 bazille.1015granger.net kernel: > __cancel_work_timer+0xea/0x13d > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > preempt_latency_start+0x2b/0x46 > Aug 20 17:12:05 bazille.1015granger.net kernel: > rdma_addr_cancel+0x70/0x81 [ib_core] > Aug 20 17:12:05 bazille.1015granger.net kernel: > _destroy_id+0x1a/0x246 [rdma_cm] > Aug 20 17:12:05 bazille.1015granger.net kernel: > rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma] > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > _raw_spin_unlock+0x14/0x29 > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > raw_spin_rq_unlock_irq+0x5/0x10 > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > finish_task_switch.isra.0+0x171/0x249 > Aug 20 17:12:05 bazille.1015granger.net kernel: > xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma] > Aug 20 17:12:05 bazille.1015granger.net kernel: > process_one_work+0x1d8/0x2d4 > Aug 20 17:12:05 bazille.1015granger.net kernel: > worker_thread+0x18b/0x24f > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > rescuer_thread+0x280/0x280 > Aug 20 17:12:05 bazille.1015granger.net kernel: kthread+0xf4/0xfc > Aug 20 17:12:05 bazille.1015granger.net kernel: ? > kthread_complete_and_exit+0x1b/0x1b > Aug 20 17:12:05 bazille.1015granger.net kernel: > ret_from_fork+0x22/0x30 > Aug 20 17:12:05 bazille.1015granger.net kernel: </TASK> > > At a guess, the recent changes to the WQ_MEM_RECLAIM settings in the > RPC xprt code did not get carried over to rpcrdma... ? Need some > guidance please, and I can write and test a fix for this. > Looks like you're trying to cancel work on a non-memory reclaim workqueue from a job running on a queue that is flagged as a memory reclaim workqueue. That's a priority inversion problem. Basically, you need to change int addr_init(void) { addr_wq = alloc_ordered_workqueue("ib_addr", 0); if (!addr_wq) return -ENOMEM; register_netevent_notifier(&nb); return 0; } and flag addr_wq as being a WQ_MEM_RECLAIM queue. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx