Re: 2.6.27 looping on SCSI errors for bad sectors

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 11 Dec 2008 17:02:04 -0500 (EST)

On Thu, 11 Dec 2008, Daniel Drake wrote:

> Hi Alan,
> 
> I'm aware of your work at http://bugzilla.kernel.org/show_bug.cgi?id=11843
> 
> I agree with fixing the unusual_devs file for USB devices that report 
> the wrong capacity, but this SCSI "looping on error" problem reaches 
> further than that. Gentoo has a bug report at 
> https://bugs.gentoo.org/show_bug.cgi?id=248698 where there is a "real" 
> bad sector in the middle of a disk, and this bug is affecting recovery 
> of said disk.
> 
> On the kernel bugzilla you posted some patches that would improve the 
> behaviour of 2.6.27 here. Are those patches candidates for 2.6.27.x, or 
> do you know if it's being fixed another way, or is it a lost cause?
> I understand that 2.6.28 has been fixed through a major rework in that area.

It's a complicated story.

For other readers, here's a summary of the Gentoo bug report.  It has
two parts: One is that 2.6.26 doesn't report a bad block using an
"unknown" controller; the other is that 2.6.27 loops indefinitely when
reading the bad block using the "unknown" controller.  The key aspect
of this controller is that when asked to read 8 sectors (4096 bytes) of
which at least one is bad, it returns 1026 bytes of data with a residue
of 3070, Check Condition status, and no sense (SK = ASC = ASCQ = 0).

The fact that the number of "good" bytes isn't a multiple of the sector
size is suspicious in itself, but let that pass.  The real problem has
to do with the lack of sense data.  When usb-storage sees there's no
sense, it changes the status to SAM_STAT_GOOD and clears the sense
buffer.  But since the number of bytes is less than it should be, the
final result is DID_ERROR with SUGGEST_RETRY.

Now, I don't remember exactly what would happen with 2.6.26 under these 
conditions.  Perhaps the SCSI layer would retry the command a few times 
and then give up, but not realize that the read had failed -- meaning 
that whatever garbage was in the buffer would be returned to the user.  

2.6.27 does retry the read, indefinitely as far as I can tell.  At
least, if there is a means for giving up eventually, I don't know what
it is.  My B'' patch provides such a means, but I doubt it will be
accepted since it would interfere with the operation of some SCSI tape
devices.

2.6.28 is slightly better in this regard.  You might say it has been
fixed; it will retry the command until a timeout expires.  However the
timeout tends to be rather large (30 or 60 seconds multiplied by 6
iterations, typically).  I don't regard this as particularly useful.

It's fair to say that at present, the SCSI core's retry and timeout 
policy is pretty messed up.  However I'm not a good person to ask about 
getting the problem fixed, because I'm not an expert SCSI developer.

In fact, the best thing would be for you to push item 1 from comment
#19 in the Gentoo bug report upstream.  That would focus the attention
of the SCSI developers and give them something concrete to work on and
to test with.  If that's what you do, add me to the CC list.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html