Re: Deadlock during DV when queue is full

Jens Axboe <jens.axboe@xxxxxxxxxx> · Wed, 30 May 2007 21:03:10 +0200

On Wed, May 30 2007, James Bottomley wrote:
> On Wed, 2007-05-30 at 20:55 +0200, Jens Axboe wrote:
> > On Wed, May 30 2007, James Bottomley wrote:
> > > On Wed, 2007-05-30 at 20:01 +0200, Jens Axboe wrote:
> > > > On Mon, May 28 2007, Andrew Patterson wrote:
> > > > > I am running into deadlock during domain validation when the request
> > > > > queue is full. I am using the MPT Fusion spi driver and have run into
> > > > > this problem with 2.6.16 and the latest scsi_misc kernels.  The system
> > > > > is running a load test on a u320 pSCSI bus with a drive that will
> > > > > occasionally hang the bus until a host reset clears the condition.  This
> > > > > particular drive in known to not handle QAS very well.  After the host
> > > > > reset, the MPT Fusion driver attempts domain validation on all drives on
> > > > > the bus. During DV, one or more of the queues lockup while trying to
> > > > > execute various SCSI commands (INQUIRY, WRITE_BUFFER, etc) using the
> > > > > scsi_execute() call.  A stack trace shows:
> > > > 
> > > > Ugh, that's nasty. If that is a valid scenario (and it looks like it
> > > > is), then we have reserve a request (and SCSI command) for such uses as
> > > > the below scenario is definitely livelock country.
> > > > 
> > > > > [ 2318.524898] events/1      D a0000001007258f0     0    16      2 (L-TLB)
> > > > > [ 2318.532030] 
> > > > > [ 2318.532031] Call Trace:
> > > > > [ 2318.532202]  [<a000000100724750>] schedule+0x1550/0x1840
> > > > > [ 2318.532204]                                 sp=e00000010a8dfc60 bsp=e00000010a8d8ff0
> > > > > [ 2318.546975]  [<a0000001007258f0>] io_schedule+0x50/0x80
> > > > > [ 2318.546977]                                 sp=e00000010a8dfcf0 bsp=e00000010a8d8fd0
> > > > > [ 2318.554417]  [<a0000001003b8820>] get_request_wait+0x200/0x2c0
> > > > > [ 2318.554419]                                 sp=e00000010a8dfcf0 bsp=e00000010a8d8f78
> > > > > [ 2318.562540]  [<a0000001003b8990>] blk_get_request+0xb0/0x120
> > > > > [ 2318.562542]                                 sp=e00000010a8dfd40 bsp=e00000010a8d8f40
> > > > > [ 2318.579166]  [<a00000010058b5e0>] scsi_execute+0x40/0x1e0
> > > > > [ 2318.579168]                                 sp=e00000010a8dfd40 bsp=e00000010a8d8ee8
> > > > > [ 2318.586863]  [<a0000001005980f0>] spi_execute+0x70/0x120
> > > > > [ 2318.586865]                                 sp=e00000010a8dfd40 bsp=e00000010a8d8e88
> > > > > [ 2318.594204]  [<a000000100599650>] spi_dv_device_echo_buffer+0x2f0/0x520
> > > > > [ 2318.594206]                                 sp=e00000010a8dfdc0 bsp=e00000010a8d8e30
> > > > > [ 2318.607333]  [<a000000100597a30>] spi_dv_retrain+0x70/0x520
> > > > > [ 2318.607335]                                 sp=e00000010a8dfde0 bsp=e00000010a8d8dc0
> > > > > [ 2318.616119]  [<a000000100599170>] spi_dv_device+0xdf0/0xf00
> > > > > [ 2318.616121]                                 sp=e00000010a8dfde0 bsp=e00000010a8d8d40
> > > > > [ 2318.630538]  [<a00000020db7e360>] mptspi_dv_device+0x160/0x2c0 [mptspi]
> > > > > [ 2318.630540]                                 sp=e00000010a8dfdf0 bsp=e00000010a8d8ce0
> > > > > [ 2318.638341]  [<a00000020db7e660>] mptspi_dv_renegotiate_work+0x1a0/0x220 [mptspi]
> > > > > [ 2318.638343]                                 sp=e00000010a8dfdf0 bsp=e00000010a8d8cb0
> > > > > [ 2318.652773]  [<a0000001000b80c0>] run_workqueue+0x1c0/0x320
> > > > > [ 2318.652775]                                 sp=e00000010a8dfe00 bsp=e00000010a8d8c80
> > > > > [ 2318.660003]  [<a0000001000b8460>] worker_thread+0x240/0x280
> > > > > [ 2318.660005]                                 sp=e00000010a8dfe00 bsp=e00000010a8d8c50
> > > > > [ 2318.667536]  [<a0000001000c24e0>] kthread+0xa0/0x120
> > > > > [ 2318.667538]                                 sp=e00000010a8dfe30 bsp=e00000010a8d8c20
> > > > > [ 2318.681699]  [<a0000001000129f0>] kernel_thread_helper+0xd0/0x100
> > > > > [ 2318.681701]                                 sp=e00000010a8dfe30 bsp=e00000010a8d8bf0
> > > > > [ 2318.689121]  [<a0000001000094c0>] start_kernel_thread+0x20/0x40
> > > > > [ 2318.689124]                                 sp=e00000010a8dfe30 bsp=e00000010a8d8bf0
> > > > > 
> > > > > 
> > > > > Some code examination and tracing show that get_request_wait() calls
> > > > > get_request() to obtain a request.  If get_request() returns NULL, it
> > > > > will wait and try again.  Here is the code from get_request_wait():
> > > > > 
> > > > > 	rq = get_request(q, rw_flags, bio, GFP_NOIO);
> > > > > 	while (!rq) {
> > > > > 		DEFINE_WAIT(wait);
> > > > > 		struct request_list *rl = &q->rq;
> > > > > 
> > > > > 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
> > > > > 				TASK_UNINTERRUPTIBLE);
> > > > > 
> > > > > 		rq = get_request(q, rw_flags, bio, GFP_NOIO);
> > > > > 
> > > > > 		if (!rq) {
> > > > > 			struct io_context *ioc;
> > > > > 			blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
> > > > > 
> > > > > 			__generic_unplug_device(q);
> > > > > 			spin_unlock_irq(q->queue_lock);
> > > > > 			io_schedule();
> > > > > 
> > > > > 			/*
> > > > > 			 * After sleeping, we become a "batching" process and
> > > > > 			 * will be able to allocate at least one request, and
> > > > > 			 * up to a big batch of them for a small period time.
> > > > > 			 * See ioc_batching, ioc_set_batching
> > > > > 			 */
> > > > > 			ioc = current_io_context(GFP_NOIO, q->node);
> > > > > 			ioc_set_batching(q, ioc);
> > > > > 
> > > > > 			spin_lock_irq(q->queue_lock);
> > > > > 		}
> > > > > 		finish_wait(&rl->wait[rw], &wait);
> > > > > 	}
> > > > > 
> > > > > Note the io_schedule() here. As far as I can tell, there is not wakeup
> > > > > for this wait queue.  The only wakeup's occur when a request is freed.
> > > > > No requests can be processed because the error handling is holding off
> > > > > request processing until the error condition is cleared so we get a
> > > > > deadlock. 
> > > > > 
> > > > > Looking through get_request() we see:
> > > > > 
> > > > > 	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
> > > > > 		if (rl->count[rw]+1 >= q->nr_requests) {
> > > > > 			ioc = current_io_context(GFP_ATOMIC, q->node);
> > > > > 			/*
> > > > > 			 * The queue will fill after this allocation, so set
> > > > > 			 * it as full, and mark this process as "batching".
> > > > > 			 * This process will be allowed to complete a batch of
> > > > > 			 * requests, others will be blocked.
> > > > > 			 */
> > > > > 			if (!blk_queue_full(q, rw)) {
> > > > > 				ioc_set_batching(q, ioc);
> > > > > 				blk_set_queue_full(q, rw);
> > > > > 			} else {
> > > > > 				if (may_queue != ELV_MQUEUE_MUST
> > > > > 						&& !ioc_batching(q, ioc)) {
> > > > > 					/*
> > > > > 					 * The queue is full and the allocating
> > > > > 					 * process is not a "batcher", and not
> > > > > 					 * exempted by the IO scheduler
> > > > > 					goto out;
> > > > > 				}
> > > > > 			}
> > > > > 		}
> > > > > 		blk_set_queue_congested(q, rw);
> > > > > 	}
> > > > > 
> > > > > In this heavily loaded system, we get into the "goto out" because count
> > > > > > nr_requests. The "goto out" will lead to returning NULL. This
> > > > > condition would not occur if ioc_batching was set, but this is not done
> > > > > until after the io_schedule() in get_request_wait().  
> > > > 
> > > > It doesn't matter, memory allocation could still block due to reclaim
> > > > which wont happen because no more IO is getting through. Or if you went
> > > > atomic it could also fail.
> > > > 
> > > > There's no other solution than maintaining a cached request + command
> > > > for this. libata has a similar issue wrt error handling with ncq, we may
> > > > need a command in error handling to retrieve the log page.
> > > 
> > > Actually, there is another solution: DV is careful only to be using a
> > > single command for its processes ... if we could use the eh command for
> > > this, then I think the problem would go away ... unfortunately, that's a
> > > bit more complex to achieve than it sounds.

(btw this is not another solution, it's indeed the solution of keeping a
reserved request :-)

> > That would be fine, the key is just to have such a reserved command. Is
> > there also a reserved request?
> 
> Yes ... we clean out the failing command in error recovery and reuse it,
> so we know it has both a command and a request.

Sounds a bit hackish, unless the failed command never needs to be
retried.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html