Re: libata-scsi: ata_to_sense_error handling status 0x40

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Fri, 2 Sep 2022 11:35:16 +0900

On 9/1/22 15:13, Hannes Reinecke wrote:
> On 9/1/22 00:54, Damien Le Moal wrote:
>> On 8/31/22 22:30, Peter Fröhlich wrote:
>>> Sorry for spamming replies and quoting myself.
>>>
>>> On Wed, Aug 31, 2022 at 12:21 PM Peter Fröhlich
>>> <peter.hans.froehlich@xxxxxxxxx> wrote:
>>>> On Wed, Aug 31, 2022 at 9:48 AM Damien Le Moal
>>>> <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
>>>>> On 8/31/22 16:15, Hannes Reinecke wrote:
>>>>>> Oh, of course :-)
>>>>>> That was when doing SMR support for libata.
>>>>>> I dimly remember that some pre-spec drives had been using the DRDY bit
>>>>>> to signal an unaligned write. Which never made it into the spec, but the
>>>>>> decoding stayed.
>>>>>
>>>>> Any idea where the other bits come from ? Except for bit 5 (device fault),
>>>>> I do not see anything else in the specs that mandate these definitions...
>>>>
>>>> I have since discovered the "SCSI to ATA" specification which has two
>>>> tables about mapping ATA errors to SCSI errors. Among those I was able
>>>> to find an "unaligned write" case as well, but I cannot properly parse
>>>> the rest of the two tables yet. They are in sections 11.6 and 11.7 of
>>>> that document.
>>>
>>> So I've re-read everything I can get my hands on and from what I can
>>> tell the overall "flow" of ata_to_sense_error() is not what the
>>> specifications would imply. For example we look at BSY on entry and
>>> then say "ah, it's set, then let's ignore the error field" when the
>>> specification (the way I read it) instead says "BSY is transport
>>> dependent, so we say nothing about it here; but check the error bit in
>>> status, if it is set, interpret the error field, otherwise there's
>>> nothing for you in the error field". Of course I am a complete noob
>>> when it comes to this ATA/SATA/SCSI/AHCI stuff, so please divide by at
>>> least two. Sorry if this adds more confusion on top.
>>
>> I had a quick look at the specs again. I already spotted an error: when
>> the status device fault bit is set, the sense should be HARDWARE ERROR /
>> INTERNAL TARGET FAILURE and not ABORTED COMMAND / 0x47 like now. That is
>> according to SAT-5. But looking at ACS-5, sections 6 and 7.1.6, there are
>> *a lot* of cases that need to be taken care of. It looks like the
>> sense_table does that, but need to cross check.
>>
>> As for the stat_table, except for the first buggy entry as mentioned
>> above, I have no clue where these come from. SAT only defines the HARDWARE
>> ERROR / INTERNAL TARGET FAILURE for when the status field device fault bit
>> is set. Need to dig further, but I am afraid this code may be due to years
>> of supporting drives returning weird errors that got mapped to sensible
>> sense codes instead of a pure implementation of the specs...
>>
>> I need to spend some quality time with ACS and SAT documents to sort out
>> this one... And lots of coffee too probably :)
>>
> And, to make matters ever more complicated, the error and status bits 
> changed over time. And even the SAT translation changed between versions.

I checked all SAT docs from v1 up to v5 and all of them define the same
for the device fault status bit.

1) If status device fault bit is set, ignore error and translate to
HARDWARE ERROR / INTERNAL TARGET FAILURE. So this is wrong in the current
code which returns ABORTED COMMAND / SCSI PARITY ERROR, which is a little
silly. We could fix this, but urgency for the fix seems to be non-existent
since no-one complained about that one. I suspect this is because this
stuff only matters for IDE drives since most NCQ drives will get sense
data from the read log 10h anyway. So that weird stat_table is likely
never used, even for IDE drives as the sense_table gets a hit all the time
first.

3) When the status device fault bit is not set and the error bit is set,
then the error bits are defined differently across SAT revision, but they
all look backward compatible though.

3) None of the SAT specs have anything about "unaligned" error defined. I
think it is safe to remove that one as a fix. Will you send a patch ?

Peter,

Your drive seems to be an exception to my (1) statement and the error it
returns seems weird enough that the stat_table ends up being used.
Could you send a dmesg output of a failed command so that we can see the
err_mask etc info for the failed command ? And it would be good to add a
print of the drv_stat and drv_err parameters passed to
ata_to_sense_error() for the failures you are seeing. That would help
trying to figure out what your drive is attempting to signal.

Also, please send the output of "hdparm -I" for that SSD please, so that
we have information about what standard it is (supposedly) following.

> So there really is no clear "that's the way to go" style of thing; if we 
> want to be correct we would need to evaluate the ATA version for that 
> device, and have different translation tables depending on the version.
> 
> Not sure if it's worth it, though; in the end it's just an error 
> description which will get changed. Commands will be aborted in either 
> case, so the net result is close to zero :-)
> 
> Cheers,
> 
> Hannes

-- 
Damien Le Moal
Western Digital Research