Re: libata-scsi: ata_to_sense_error handling status 0x40

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Wed, 31 Aug 2022 10:40:17 +0900

On 2022/08/30 16:02, Peter Fröhlich wrote:
> On Tue, Aug 30, 2022 at 1:26 AM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
>> On Mon, 2022-08-29 at 08:04 +0200, Peter Fröhlich wrote:
>>> That's the sense_table, I was referring to the stat_table. That table
>>> is consulted when we fail to convert via the sense_table.
>> ...
>> So looking at the right code again, this is all very strange. E.g. the
>> ACS specs define bit 5 of the status field as the "device fault" bit,
>> but the code looks at 0x08, so bit 3. For write command, the definition
>> is:
>>
>> STATUS
>> Bit Description
>> 7:6 Transport Dependent – See 6.2.11
>> 5 DEVICE FAULT bit – See 6.2.6
>> 4 N/A
>> 3 Transport Dependent – See 6.2.11
>> 2 N/A
>> 1 SENSE DATA AVAILABLE bit – See 6.2.9
>> 0 ERROR bit – See 6.2.8
>>
>> And the code is:
>>
>>         static const unsigned char stat_table[][4] = {
>>                 /* Must be first because BUSY means no other bits valid
>> */
>>                 {0x80,          ABORTED_COMMAND, 0x47, 0x00},
>>                 // Busy, fake parity for now
>>                 {0x40,          ILLEGAL_REQUEST, 0x21, 0x04},
>>                 // Device ready, unaligned write command
>>                 {0x20,          HARDWARE_ERROR,  0x44, 0x00},
>>                 // Device fault, internal target failure
>>                 {0x08,          ABORTED_COMMAND, 0x47, 0x00},
>>                 // Timed out in xfer, fake parity for now
>>                 {0x04,          RECOVERED_ERROR, 0x11, 0x00},
>>                 // Recovered ECC error    Medium error, recovered
>>                 {0xFF, 0xFF, 0xFF, 0xFF}, // END mark
>>         };
>>
>> So this does not match at all. Something wrong here, or, the "status"
>> field being observed here is not the one I am thinking of. Checking
>> AHCI & SATA-IO specs, I do not see anything matching there either.
> 
> Thank you for confirming that this section *is* confusing. I was down
> the same rabbit-hole checking "status" in whatever ATA spec I could
> get my hands on, and it didn't help. Specifically for "WRITE DMA"
> where I usually see the error, it seems that bit 6 has no other
> meaning than "transport dependent" which for SATA means (I believe)
> "drive ready" as it's always been. But that 0x40 is *not* an
> "unaligned write" whatever else it may be. My suspicion is that maybe
> it went in by accident since it's also in a "whitespace" commit. On
> the other hand, it has an explicit comment. I wasn't going to bother
> the original author before, but maybe now I should?

+Hannes

Except for bit 0x20 (device fault), the other bits do not match anything
sensible either. So I wonder what specs this is against. Hannes ? 7-years old
patch... I am sure your memory is very fresh about this one :)

>>> Which is why I am pretty sure that the "unaligned write" message is
>>> spurious since I am writing to a plain old SSD. It's going to be hard
>>> for a userspace program to generate a write that is no properly
>>> aligned for the SSD.
>>
>> Unless your SSD is really buggy and throws strange errors, which is
>> always a possibility. Do you have a good reproducer of the problem ?
> 
> Not really, sadly. For me it happens with SSDs behind a Marvell SATA
> controller but it doesn't happen when the same SSD goes behind a
> fancier SAS controller. This is what led me into the ATA/SCSI layer as
> the possible culprit because on the SAS boxes that layer is not used.

Yes, with a SAS HBA that has SAT implemented in FW, the HBA FW will do the
conversion to sense data for failed commands. No way of knowing how that is done
there.

> BTW there's another "strange" effect that sometimes seems to lose the
> LBA flag on the ATA taskfile struct resulting in an obscure error
> message about failed CHS addressing. In that case I suspect an
> initialization gone wrong or maybe a race condition somewhere, but
> it's been a real pain to track down further. If I ever get a better
> handle on how to repro this stuff, I certainly will share.

Yes, that type of error generally means something goes badly during scanning or
revalidate, e.g. access to a log page failing. That is a fairly common problems
on many drives (e.g. drives advertising support for READ LOG DMA EXT but that
command in fact not working). Your drive may need some quirks to get a reliable
scan.

Have you checked if your drive already has some entry in ata_device_blacklist
(in libata-core.c) ?

> 
> Cheers,
> Peter

-- 
Damien Le Moal
Western Digital Research