RE: [PATCH for-rc 1/5] IB/hfi1: Fix WQ_MEM_RECLAIM warning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On May 06, 2019 at 4:19 PM, Chuck Lever <chuck.level@oracle> wrote:
> On May 6, 2019, at 4:08 PM, Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>> On Mon, May 06, 2019 at 10:03:56PM +0300, Leon Romanovsky wrote:
>>> On Mon, May 06, 2019 at 03:16:10PM -0300, Jason Gunthorpe wrote:
>>>> On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
>>>>> Don't lose sight of the fact that the additional of the
>>>>> WQ_MEM_RECLAIM is to silence a warning BECAUSE ipoib's workqueue is
>>>>> WQ_MEM_RECLAIM.  This happens while
>>>>> rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by
>>>>> the QP's send engine
>>>>
>>>> Well, it is doing unsafe memory allocations and other stuff, so it
>>>> can't be RECLAIM. We should just delete them from IPoIB like Doug says.
>>>
>>> Please don't.
>>
>> Well then fix the broken allocations it does, and I don't really see
>> how to do that. We can't have it both ways.
>>
>> I would rather have NFS be broken then normal systems with ipoib
>> hanging during reclaim.
>
> TBH, NFS on IPoIB is a hack that I would be happy to see replaced with
> NFS/RDMA. However, I don't think we can have it regressing at this point -
> - and it sounds like there are other use cases that do depend on
> WQ_MEM_RECLAIM remaining in the IPoIB path.

I don't have an opinion on whether it is better to specify WQ_MEM_RECLAIM or not, although we are encountering a regression in the hfi1 driver when it is set. I have been debugging an issue that started happening during certain high-stress conditions after we added the WQ_MEM_RECLAIM attribute to our workqueue.

We need to understand this issue, especially if it turns out to be a workqueue bug/restriction, if we're going to propagate WQ_MEM_RECLAIM.

Somewhat-gory WQ-specific details:

The error is detected by the "sanity checks" in destroy_workqueue(), which we call when unloading our driver. A stack dump is reported via WARN_ON and the function (which has no return code) exits without destroying the queue, creating additional errors later. The specific test that is failing is on the pwq reference count:

	WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1))

That reference count is incremented by get_pwq(), which is called when adding work to the queue. It is also called by "mayday" handling in the send_mayday() function from the pool_mayday_timeout() timer handler and from within rescuer_thread() in some situations. rescuer_thread() also calls put_pwq() later to decrement the reference count.

By instrumenting workqueue.c, I determined that we are not adding any work to the queue at the time of the error (and that the queue is empty when the error occurs), although before that, send_mayday() is occasionally called on our workqueue, followed by the second get_pwq() from rescuer_thread().

This is happening during normal operation, although while the system is heavily loaded. Later, when we try to unload the driver, destroy_workqueue() fails due to the refcnt check, even though the queue is empty.

Here are the relevant checks around the rescuer_thread() call to get_pwq() that increments refcnt:

		if (!list_empty(scheduled)) {
			process_scheduled_works(rescuer);

			/*
			 * The above execution of rescued work items could
			 * have created more to rescue through
			 * pwq_activate_first_delayed() or chained
			 * queueing.  Let's put @pwq back on mayday list so
			 * that such back-to-back work items, which may be
			 * being used to relieve memory pressure, don't
			 * incur MAYDAY_INTERVAL delay inbetween.
			 */
			if (need_to_create_worker(pool)) {
				spin_lock(&wq_mayday_lock);
				get_pwq(pwq);

And here is the relevant code from send_mayday():

	/* mayday mayday mayday */
	if (list_empty(&pwq->mayday_node)) {
		/*
		 * If @pwq is for an unbound wq, its base ref may be put at
		 * any time due to an attribute change.  Pin @pwq until the
		 * rescuer is done with it.
		 */
		get_pwq(pwq);

To me, the comment in the latter seems to imply that the reference count is being incremented to keep the pwq alive. However, looking through the code, that does not appear to be how it works. destroy_workqueue() tries to flush it, but if the queue is empty yet refcnt is still greater than 1, it fails. Maybe there's something else going on behind the scenes that I don't understand here.

I am trying to understand what is going wrong and whether there is something we can do about it. I'm hoping Tejun can provide some guidance into investigating this issue, otherwise we may need to move the discussion over to lkml or something.

--
Gerald Williams




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux