Re: libata-scsi: ata_to_sense_error handling status 0x40

Peter Fröhlich <peter.hans.froehlich@xxxxxxxxx> · Tue, 30 Aug 2022 09:02:16 +0200

On Tue, Aug 30, 2022 at 1:26 AM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
> On Mon, 2022-08-29 at 08:04 +0200, Peter Fröhlich wrote:
> > That's the sense_table, I was referring to the stat_table. That table
> > is consulted when we fail to convert via the sense_table.
> ...
> So looking at the right code again, this is all very strange. E.g. the
> ACS specs define bit 5 of the status field as the "device fault" bit,
> but the code looks at 0x08, so bit 3. For write command, the definition
> is:
>
> STATUS
> Bit Description
> 7:6 Transport Dependent – See 6.2.11
> 5 DEVICE FAULT bit – See 6.2.6
> 4 N/A
> 3 Transport Dependent – See 6.2.11
> 2 N/A
> 1 SENSE DATA AVAILABLE bit – See 6.2.9
> 0 ERROR bit – See 6.2.8
>
> And the code is:
>
>         static const unsigned char stat_table[][4] = {
>                 /* Must be first because BUSY means no other bits valid
> */
>                 {0x80,          ABORTED_COMMAND, 0x47, 0x00},
>                 // Busy, fake parity for now
>                 {0x40,          ILLEGAL_REQUEST, 0x21, 0x04},
>                 // Device ready, unaligned write command
>                 {0x20,          HARDWARE_ERROR,  0x44, 0x00},
>                 // Device fault, internal target failure
>                 {0x08,          ABORTED_COMMAND, 0x47, 0x00},
>                 // Timed out in xfer, fake parity for now
>                 {0x04,          RECOVERED_ERROR, 0x11, 0x00},
>                 // Recovered ECC error    Medium error, recovered
>                 {0xFF, 0xFF, 0xFF, 0xFF}, // END mark
>         };
>
> So this does not match at all. Something wrong here, or, the "status"
> field being observed here is not the one I am thinking of. Checking
> AHCI & SATA-IO specs, I do not see anything matching there either.

Thank you for confirming that this section *is* confusing. I was down
the same rabbit-hole checking "status" in whatever ATA spec I could
get my hands on, and it didn't help. Specifically for "WRITE DMA"
where I usually see the error, it seems that bit 6 has no other
meaning than "transport dependent" which for SATA means (I believe)
"drive ready" as it's always been. But that 0x40 is *not* an
"unaligned write" whatever else it may be. My suspicion is that maybe
it went in by accident since it's also in a "whitespace" commit. On
the other hand, it has an explicit comment. I wasn't going to bother
the original author before, but maybe now I should?

> > Which is why I am pretty sure that the "unaligned write" message is
> > spurious since I am writing to a plain old SSD. It's going to be hard
> > for a userspace program to generate a write that is no properly
> > aligned for the SSD.
>
> Unless your SSD is really buggy and throws strange errors, which is
> always a possibility. Do you have a good reproducer of the problem ?

Not really, sadly. For me it happens with SSDs behind a Marvell SATA
controller but it doesn't happen when the same SSD goes behind a
fancier SAS controller. This is what led me into the ATA/SCSI layer as
the possible culprit because on the SAS boxes that layer is not used.
BTW there's another "strange" effect that sometimes seems to lose the
LBA flag on the ATA taskfile struct resulting in an obscure error
message about failed CHS addressing. In that case I suspect an
initialization gone wrong or maybe a race condition somewhere, but
it's been a real pain to track down further. If I ever get a better
handle on how to repro this stuff, I certainly will share.

Cheers,
Peter