Re: [PATCH rdma-next] RDMA/addr: create addr_wq with WQ_MEM_RECLAIM flag

Potnuri Bharat Teja <bharat@xxxxxxxxxxx> · Wed, 13 Mar 2019 19:38:18 +0530



On Thursday, February 02/07/19, 2019 at 03:49:47 +0530, Steve Wise wrote:
> 
> On 2/6/2019 3:52 PM, Steve Wise wrote:
> > On 2/5/2019 4:39 PM, Jason Gunthorpe wrote:
> >> On Thu, Jan 31, 2019 at 11:30:42AM -0800, Steve Wise wrote:
> >>> While running NVMe/oF wire unplug tests, we hit this warning in
> >>> kernel/workqueue.c:check_flush_dependency():
> >>>
> >>> WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> >>> 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
> >>> 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
> >>> 	  worker->current_pwq->wq->name, worker->current_func,
> >>> 	  target_wq->name, target_func);
> >>>
> >>> Which I think means we're flushing a workq that doesn't have
> >>> WQ_MEM_RECLAIM set, from workqueue context that does have it set.
> >>>
> >>> Looking at rdma_addr_cancel() which is doing the flushing, it flushes
> >>> the addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel()
> >>> is being called by the nvme host connection timeout/reconnect workqueue
> >>> thread that does have WQ_MEM_RECLAIM set.
> >> Since we haven't learned anything more, I think you should look to
> >> remove either the WQ_MEM_RECLAIM or the rdma_addr_cancel() from the
> >> nvme side.
> > The nvme code is just calling rdma_destroy_id(), which in turn calls
> > rdma_addr_cancel(),  so I'll have to remove WQ_MEM_RECLAIM from the
> > workqueue.
> >
> > I'll post this patch then.
> 
> 
> What a mess.  If I remove RECLAIM for nvme_wq, then I'll regress this
> change, I think:
> 
> c669ccdc50c2 ("nvme: queue ns scanning and async request from nvme_wq")
> 
Hi All,
I see this issue with some basic tests. Looks like the discussion ended here.
Please suggest where this ideally needs to be fixed so that I can try fixing it.

Thanks,
Bharat