Re: [PATCH] scsi_transport_fc: handle transient error on multipath environment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mike,

Sorry for the late reply, and thanks for giving me advice.

> On 02/12/2010 11:46 AM, Mike Christie wrote:

> What transport problems are you seeing where the rport is not blocked
> and the scsi cmd timer fires? Would it be mostly buggy switches or
> something like that?

 When a disk is broken (broken FC switch is possible though) there is a case
 that SCSI command times out without blocking fibre channel remote port which
 simply means no response (no hardware interrupt on I/O completion) from the disk.
 In the current SCSI driver, that ends up fc_timed_out returning BLK_EH_NOT_HANDLED,
 and as a result it takes long time to get through unjam function of error handler.
 The older scsi driver has no transport timeout, but it basically does the same.

 The previous patch was to avoid unjam and let multipath software quickly discard
 the broken path, but as you and James had mentioned dm layer should handle I/O
 latency as path quality issue, not transport.

> > - Maybe you want to instead hook something into the dm-mutlipath's
> > request (no more bios like in 2004 :)). Can you set a timer on that
> > level of request. If that times out then, dm-multipath could do
> > something like call blk_abort_queue.
>
> Some more detail. I was thinking maybe you could stack the timeout
> handlers like is done for request_fn handlers or maybe the scsi cmd
> would use the upper layer's timer somehow. Not sure... but at the least
> I think we would not want both a scsi request and dm request timers
> running at the same time.

 I've been reading device mapper and dmsetup command code, trying to understand
 what you were saying. It seems there are two ways to do it in device mapper layer.
 One is to simply modify dm-mpath target driver, and let request_queue override the
 timeout handler and use different one instead of normal scsi timeout handler. 
 Another is to make something like dm-timeout target driver which stacks timeout
 functionality on top of another device mapper. First one looks much easier, but
 stacking another functionality on top of dm-mpath has more flexibility (if it's
 possible).

> Then for the error handling and timeout handling, most FC drivers have a
> terminate_rport_io which works without having to block the entire host.
> Those drivers could implement a newer eh where instead of firing the
> code in scsi_error.c when a cmd times out, it would run
> terminate_rport_io from some workqueue.
>
> new dm request timed out()
> 	-> scsi_timed_out
> 		-> fc_timed_out()
> 			{
> 				run new eh from workqueue();
> 			}
> 
> new_eh()
> 	/* no new cmds should be started until we figure out what is going on */
> 	block rport()
> 	/* releases cmds upwards so they can run while we try to figure out
> what is going on */
> 	terminate_rport_io()
> 	/* check if devices are ok */
> 	send_tur()
> 	if (tur failed)
> 		start old scsi_error.c code to unjam us.
> 	else
> 		/* everything looks ok so let IO run to this path again */
> 		unblock rport()

 Agree, even if I take care of I/O latency issue in dm timeout, I still have to
 add different timeout handler in SCSI/FC layer, and that might end up removing
 remote port like what I did in the previous patch (and I agree removing a fibre
 channel remote port just because of a single I/O latency is somewhat odd).

> > 
> > I think the problem with blk_abort_queue might be that it stops all IO
> > to the entire host where you probably just want to work on the remote
> > port/path. For that you could call something like
> > recover_transient_error. Maybe it could just be a call to
> > terminate_rport_io from a workqueue though.

 I think I'll post a patch to dm-devel and scsi mailing list before long.

Thanks,
Tomohiro Kusumi



(2010/02/13 3:03), Mike Christie wrote:
> On 02/12/2010 11:46 AM, Mike Christie wrote:
>> - Maybe you want to instead hook something into the dm-mutlipath's
>> request (no more bios like in 2004 :)). Can you set a timer on that
>> level of request. If that times out then, dm-multipath could do
>> something like call blk_abort_queue.
> 
> Some more detail. I was thinking maybe you could stack the timeout
> handlers like is done for request_fn handlers or maybe the scsi cmd
> would use the upper layer's timer somehow. Not sure... but at the least
> I think we would not want both a scsi request and dm request timers
> running at the same time.
> 
> Then for the error handling and timeout handling, most FC drivers have a
> terminate_rport_io which works without having to block the entire host.
> Those drivers could implement a newer eh where instead of firing the
> code in scsi_error.c when a cmd times out, it would run
> terminate_rport_io from some workqueue.
> 
> new dm request timed out()
> 	->  scsi_timed_out
> 		->  fc_timed_out()
> 			{
> 				run new eh from workqueue();
> 			}
> 
> 
> new_eh()
> 	/* no new cmds should be started until we figure out what is going on */
> 	block rport()
> 	/* releases cmds upwards so they can run while we try to figure out
> what is going on */
> 	terminate_rport_io()
> 	/* check if devices are ok */
> 	send_tur()
> 	if (tur failed)
> 		start old scsi_error.c code to unjam us.
> 	else
> 		/* everything looks ok so let IO run to this path again */
> 		unblock rport()
> 
> 
>>
>> I think the problem with blk_abort_queue might be that it stops all IO
>> to the entire host where you probably just want to work on the remote
>> port/path. For that you could call something like
>> recover_transient_error. Maybe it could just be a call to
>> terminate_rport_io from a workqueue though.
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux