RE: [PATCH] sd: sd should not modify read capacity, cache type or write protect flag on rescan when there is a transport error

<Menny_Hamburger@xxxxxxxx> · Tue, 8 Mar 2011 09:30:37 +0000

I think the kernel should have some ability to handle this situation (or at least propagate the fact that there is a problem) without the userland being involved.

We could add this as a devinfo option - enable this functionality for specific devices.
Another totally different way may be to have the SCSI layer send some notification (unknown property, needs rescan) that could be picked up and handled by the transport layer.  

Menny

-----Original Message-----
From: linux-scsi-owner@xxxxxxxxxxxxxxx [mailto:linux-scsi-owner@xxxxxxxxxxxxxxx] On Behalf Of James Bottomley
Sent: 28 February, 2011 17:35
To: Hamburger, Menny
Cc: linux-scsi@xxxxxxxxxxxxxxx
Subject: Re: [PATCH] sd: sd should not modify read capacity, cache type or write protect flag on rescan when there is a transport error

On Sun, 2011-02-27 at 14:21 +0000, Menny_Hamburger@xxxxxxxx wrote:
> From: Menny Hamburger <Menny_Hamburger@xxxxxxxx>
> 
> When sd scan fails in apprehending capacity, cache_type or write protect flag
> property from the device, it automatically assigns a default value to the
> failed property. When rescanning, in case of transport/host error, this default 
> value is invalid since the problem is with the connection to the device and not in 
> the device itself that may (in most cases) still be intact. Applying a default value
> when failing may lead to problems when connection is re-established since the default
> value persists unless an additional rescan is performed.

That's correct.  Zero means we know there's something there but we
couldn't get the necessary information.  A zero size device can't be
read from or written to.

> This problem was witnessed when running in a iSCSI environment under multipath
> (with I/O on the active path). In this case we get a ping-ping effect where
> multipathd switches between alternate paths forever (until rescan) because the
> path checker states that the device is OK, and I/O fails immediately because of
> the 0 capacity (assigned to the device when rescanning while the device was 
> disconnected from the storage).
> 
> Reproduction over ISCSI environment:
> 1) dd if=/dev/dm-0 of=/dev/zero bs=64 count=10000
> 2) ifdown ethN, ethM, ethK, ... (where ethX is an interface from which the
>    machine establishes connection to the storage array).
> 3) iscsiadm -m session -R
> 4) ifup ethN, ethM, ethK, ...

This really doesn't look like a good idea.  It's a layering violation in
that the SCSI mid layer now has to try to determine if certain command
failures are the result of host disruption.

The idea of believing a prior value if a READ_CAPACITY fails also
doesn't look to be such a good one.  This could lead to volume
corruption if the disruption is part of an array configuration.

The correct fix looks to be to initiate a rescan when the host is active
via hotplug, and just teach the path checker about zero size devices?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
ÿô.nÇ‰·Ÿ®‰†+%ŠË±é¥Šwÿº{.nÇ‰·¥Š{±þÇ‹ø¡Ü}©ž²ÆzÚj:+v‰¨þø®w¥þŠàÞ¨è&¢)ß«a¶Úÿûz¹ÞúŽŠÝjÿŠwèf