Re: [PATCH] dm mpath: Try recover from I/O failure by re-initializing the PG if device is running on one path

Kiyoshi Ueda <k-ueda@xxxxxxxxxxxxx> · Tue, 21 Apr 2009 10:06:44 +0900

Hi Babu,

On 2009/04/21 3:05 +0900, Moger, Babu wrote:
> This patch introduces the mechanism to recover from I/O failures by
> re-initializing the path if the device is running on only one path. 
> 
> Problem: Device mapper fails the path for every I/O error. It does not
> care about the type of error. There are certain errors which can be
> recovered by re-initializing the path again. I have seen this problem
> during my testing on rdac device handler. I have observed I/O errors
> when there is a change in Lun ownership. When Lun ownership changes
> device will return back with check condition with
> sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed).
> Currently, device mapper fails the path for this error and eventually
> this will lead to I/O error. We don't want to see I/O error for this reason. 

Shouldn't we handle this type of device error inside device handler?

> The patch will set the flag pg_init_required if the device is running
> on single path. The process_queued_ios will re-initialize path if required.
> I have tested this patch on LSI rdac handler.
> 
> Signed-off-by: Babu Moger <babu.moger@xxxxxxx>
> ---
> 
> --- linux-2.6.30-rc2/drivers/md/dm-mpath.c.orig	2009-04-17 16:49:33.000000000 -0500
> +++ linux-2.6.30-rc2/drivers/md/dm-mpath.c	2009-04-17 17:09:51.000000000 -0500
> @@ -1152,6 +1152,15 @@ static int do_end_io(struct multipath *m
>  		return error;
>  
>  	spin_lock_irqsave(&m->lock, flags);
> +	/*
> +	 * If this is the only path left, then lets try to
> +	 * re-initialize the PG one last time..
> +	 */
> +	if (m->nr_valid_paths == 1 && m->hw_handler_name) {
> +		m->pg_init_required = 1;
> +		spin_unlock_irqrestore(&m->lock, flags);
> +		goto requeue;
> +	}
>  	if (!m->nr_valid_paths) {
>  		if (__must_push_back(m)) {
>  			spin_unlock_irqrestore(&m->lock, flags);

What happens in case of a real I/O error (e.g. I/O to a broken sector)?
Is it correctly handled and returned to upper layer at last?
I'm asking that because the change looks dm retries such errors forever.
Or am I missing anything?

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html