On Wed, 2007-05-30 at 21:03 +0200, Jens Axboe wrote: > On Wed, May 30 2007, James Bottomley wrote: > > On Wed, 2007-05-30 at 20:55 +0200, Jens Axboe wrote: > > > On Wed, May 30 2007, James Bottomley wrote: > > > > On Wed, 2007-05-30 at 20:01 +0200, Jens Axboe wrote: > > > > > On Mon, May 28 2007, Andrew Patterson wrote: > > > > > > I am running into deadlock during domain validation when the request > > > > > > queue is full. I am using the MPT Fusion spi driver and have run into > > > > > > this problem with 2.6.16 and the latest scsi_misc kernels. The system > > > > > > is running a load test on a u320 pSCSI bus with a drive that will > > > > > > occasionally hang the bus until a host reset clears the condition. This > > > > > > particular drive in known to not handle QAS very well. After the host > > > > > > reset, the MPT Fusion driver attempts domain validation on all drives on > > > > > > the bus. During DV, one or more of the queues lockup while trying to > > > > > > execute various SCSI commands (INQUIRY, WRITE_BUFFER, etc) using the > > > > > > scsi_execute() call. A stack trace shows: > > > > > > > > > > Ugh, that's nasty. If that is a valid scenario (and it looks like it > > > > > is), then we have reserve a request (and SCSI command) for such uses as > > > > > the below scenario is definitely livelock country. > > > > > > > > > > > [ 2318.524898] events/1 D a0000001007258f0 0 16 2 (L-TLB) > > > > > > [ 2318.532030] > > > > > > [ 2318.532031] Call Trace: > > > > > > [ 2318.532202] [<a000000100724750>] schedule+0x1550/0x1840 > > > > > > [ 2318.532204] sp=e00000010a8dfc60 bsp=e00000010a8d8ff0 > > > > > > [ 2318.546975] [<a0000001007258f0>] io_schedule+0x50/0x80 > > > > > > [ 2318.546977] sp=e00000010a8dfcf0 bsp=e00000010a8d8fd0 > > > > > > [ 2318.554417] [<a0000001003b8820>] get_request_wait+0x200/0x2c0 > > > > > > [ 2318.554419] sp=e00000010a8dfcf0 bsp=e00000010a8d8f78 > > > > > > [ 2318.562540] [<a0000001003b8990>] blk_get_request+0xb0/0x120 > > > > > > [ 2318.562542] sp=e00000010a8dfd40 bsp=e00000010a8d8f40 > > > > > > [ 2318.579166] [<a00000010058b5e0>] scsi_execute+0x40/0x1e0 > > > > > > [ 2318.579168] sp=e00000010a8dfd40 bsp=e00000010a8d8ee8 > > > > > > [ 2318.586863] [<a0000001005980f0>] spi_execute+0x70/0x120 > > > > > > [ 2318.586865] sp=e00000010a8dfd40 bsp=e00000010a8d8e88 > > > > > > [ 2318.594204] [<a000000100599650>] spi_dv_device_echo_buffer+0x2f0/0x520 > > > > > > [ 2318.594206] sp=e00000010a8dfdc0 bsp=e00000010a8d8e30 > > > > > > [ 2318.607333] [<a000000100597a30>] spi_dv_retrain+0x70/0x520 > > > > > > [ 2318.607335] sp=e00000010a8dfde0 bsp=e00000010a8d8dc0 > > > > > > [ 2318.616119] [<a000000100599170>] spi_dv_device+0xdf0/0xf00 > > > > > > [ 2318.616121] sp=e00000010a8dfde0 bsp=e00000010a8d8d40 > > > > > > [ 2318.630538] [<a00000020db7e360>] mptspi_dv_device+0x160/0x2c0 [mptspi] > > > > > > [ 2318.630540] sp=e00000010a8dfdf0 bsp=e00000010a8d8ce0 > > > > > > [ 2318.638341] [<a00000020db7e660>] mptspi_dv_renegotiate_work+0x1a0/0x220 [mptspi] > > > > > > [ 2318.638343] sp=e00000010a8dfdf0 bsp=e00000010a8d8cb0 > > > > > > [ 2318.652773] [<a0000001000b80c0>] run_workqueue+0x1c0/0x320 > > > > > > [ 2318.652775] sp=e00000010a8dfe00 bsp=e00000010a8d8c80 > > > > > > [ 2318.660003] [<a0000001000b8460>] worker_thread+0x240/0x280 > > > > > > [ 2318.660005] sp=e00000010a8dfe00 bsp=e00000010a8d8c50 > > > > > > [ 2318.667536] [<a0000001000c24e0>] kthread+0xa0/0x120 > > > > > > [ 2318.667538] sp=e00000010a8dfe30 bsp=e00000010a8d8c20 > > > > > > [ 2318.681699] [<a0000001000129f0>] kernel_thread_helper+0xd0/0x100 > > > > > > [ 2318.681701] sp=e00000010a8dfe30 bsp=e00000010a8d8bf0 > > > > > > [ 2318.689121] [<a0000001000094c0>] start_kernel_thread+0x20/0x40 > > > > > > [ 2318.689124] sp=e00000010a8dfe30 bsp=e00000010a8d8bf0 > > > > > > > > > > > > > > > > > > Some code examination and tracing show that get_request_wait() calls > > > > > > get_request() to obtain a request. If get_request() returns NULL, it > > > > > > will wait and try again. Here is the code from get_request_wait(): > > > > > > > > > > > > rq = get_request(q, rw_flags, bio, GFP_NOIO); > > > > > > while (!rq) { > > > > > > DEFINE_WAIT(wait); > > > > > > struct request_list *rl = &q->rq; > > > > > > > > > > > > prepare_to_wait_exclusive(&rl->wait[rw], &wait, > > > > > > TASK_UNINTERRUPTIBLE); > > > > > > > > > > > > rq = get_request(q, rw_flags, bio, GFP_NOIO); > > > > > > > > > > > > if (!rq) { > > > > > > struct io_context *ioc; > > > > > > blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); > > > > > > > > > > > > __generic_unplug_device(q); > > > > > > spin_unlock_irq(q->queue_lock); > > > > > > io_schedule(); > > > > > > > > > > > > /* > > > > > > * After sleeping, we become a "batching" process and > > > > > > * will be able to allocate at least one request, and > > > > > > * up to a big batch of them for a small period time. > > > > > > * See ioc_batching, ioc_set_batching > > > > > > */ > > > > > > ioc = current_io_context(GFP_NOIO, q->node); > > > > > > ioc_set_batching(q, ioc); > > > > > > > > > > > > spin_lock_irq(q->queue_lock); > > > > > > } > > > > > > finish_wait(&rl->wait[rw], &wait); > > > > > > } > > > > > > > > > > > > Note the io_schedule() here. As far as I can tell, there is not wakeup > > > > > > for this wait queue. The only wakeup's occur when a request is freed. > > > > > > No requests can be processed because the error handling is holding off > > > > > > request processing until the error condition is cleared so we get a > > > > > > deadlock. > > > > > > > > > > > > Looking through get_request() we see: > > > > > > > > > > > > if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) { > > > > > > if (rl->count[rw]+1 >= q->nr_requests) { > > > > > > ioc = current_io_context(GFP_ATOMIC, q->node); > > > > > > /* > > > > > > * The queue will fill after this allocation, so set > > > > > > * it as full, and mark this process as "batching". > > > > > > * This process will be allowed to complete a batch of > > > > > > * requests, others will be blocked. > > > > > > */ > > > > > > if (!blk_queue_full(q, rw)) { > > > > > > ioc_set_batching(q, ioc); > > > > > > blk_set_queue_full(q, rw); > > > > > > } else { > > > > > > if (may_queue != ELV_MQUEUE_MUST > > > > > > && !ioc_batching(q, ioc)) { > > > > > > /* > > > > > > * The queue is full and the allocating > > > > > > * process is not a "batcher", and not > > > > > > * exempted by the IO scheduler > > > > > > goto out; > > > > > > } > > > > > > } > > > > > > } > > > > > > blk_set_queue_congested(q, rw); > > > > > > } > > > > > > > > > > > > In this heavily loaded system, we get into the "goto out" because count > > > > > > > nr_requests. The "goto out" will lead to returning NULL. This > > > > > > condition would not occur if ioc_batching was set, but this is not done > > > > > > until after the io_schedule() in get_request_wait(). > > > > > > > > > > It doesn't matter, memory allocation could still block due to reclaim > > > > > which wont happen because no more IO is getting through. Or if you went > > > > > atomic it could also fail. > > > > > > > > > > There's no other solution than maintaining a cached request + command > > > > > for this. libata has a similar issue wrt error handling with ncq, we may > > > > > need a command in error handling to retrieve the log page. > > > > > > > > Actually, there is another solution: DV is careful only to be using a > > > > single command for its processes ... if we could use the eh command for > > > > this, then I think the problem would go away ... unfortunately, that's a > > > > bit more complex to achieve than it sounds. > > (btw this is not another solution, it's indeed the solution of keeping a > reserved request :-) > > > > That would be fine, the key is just to have such a reserved command. Is > > > there also a reserved request? > > > > Yes ... we clean out the failing command in error recovery and reuse it, > > so we know it has both a command and a request. > > Sounds a bit hackish, unless the failed command never needs to be > retried. Oh, it does. We clean it out, save the necessary on the stack and reuse it, then restore the data and send it for a retry. Up until a while ago it was what all the old_ fields were for in the SCSI command; now, after Christoph fixed it, we save them on the stack instead. James - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html