Re: [Bug 13594] New: SMART responses for SATA disks on SAS get interpreted as errors

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Sun, 21 Jun 2009 13:47:51 -0500

On Sun, 2009-06-21 at 17:26 +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13594
> 
>            Summary: SMART responses for SATA disks on SAS get interpreted
>                     as errors
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 2.6.30-rc6
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: SCSI
>         AssignedTo: linux-scsi@xxxxxxxxxxxxxxx
>         ReportedBy: sgunderson@xxxxxxxxxxx
>         Regression: No
> 
> 
> Hi,
> 
> I just bought a LSI SAS3081E-R which I use against a Supermicro backplane to
> drive ten Seagate SATA disks (7200.11, 750GB and 1.5GB). I'm using the
> standard Linux Fusion MPT device driver (CONFIG_FUSION_SAS) under Linux
> 2.6.30-rc6. Everything seems to work pretty well, with one exception: When I
> use SMART against the drives (say, smartctl -a /dev/sda) the kernel complains
> with:
> 
>   [  811.091916] sd 0:0:0:0: [sda] Sense Key : Recovered Error [current]
> [descriptor]
>   [  811.099807] Descriptor sense data with sense descriptors (in hex):
>   [  811.106175]         72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
>   [  811.113262]         00 4f 00 c2 00 50
>   [  811.117379] sd 0:0:0:0: [sda] Add. Sense: ATA pass through information
> available

This is a message the kernel prints out on all recovered error returns
(except those marked REQ_QUIET).  It's purely informational and doesn't
affect return processing of the command at all, so the kernel is
actually treating this as a successful completion not an error.

> I've tried upgrading to the newest firmware (1.28.02.00, 05-MAY-2009), but
> all that changed is that the hex dump was added to the error message.
> 
> Whenever this happens, it appears like all the disks “hiccup” and the kernel
> loses contact with the controller for a small while. If too many of these
> happen at once, eventually disks start falling off RAIDs, and the entire
> machine goes down. It looks to me as if these messages should simply not be
> treated as errors by the kernel -- smartctl explicitly asks for a response even
> if the command doesn't fail (by setting CK_COND), so the response probably
> shouldn't be taken as an error.

So this sounds like the bug ... however, for the LSI card, this bug will
be in the SAT layer in the fusion firmware.  I can shut the kernel up by
making the recovered error processing clause look for 01/00/1D as well
as REQ_QUIET, but it won't affect this problem.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html