Re: libata-scsi: ata_to_sense_error handling status 0x40

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Mon, 29 Aug 2022 23:26:38 +0000

On Mon, 2022-08-29 at 08:04 +0200, Peter Fröhlich wrote:
> On Mon, Aug 29, 2022 at 1:20 AM Damien Le Moal
> <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
> > On 2022/08/26 21:00, Peter Fröhlich wrote:
> > > In commit 8ae720449fca4b1d0294c0a0204c0c45556a3e61 "libata:
> > > whitespace
> > > fixes in ata_to_sense_error()" we find, among many actual
> > > whitespace
> > > changes, another change that adds an entry for 0x40 to the
> > > stat_table.
> > > ...
> > 
> > See the code below that table definition. You can see this hunk:
> > 
> >         if (drv_err) {
> >                 /* Look for drv_err */
> >                 for (i = 0; sense_table[i][0] != 0xFF; i++) {
> >                         /* Look for best matches first */
> >                         if ((sense_table[i][0] & drv_err) ==
> >                             sense_table[i][0]) {
> >                                 *sk = sense_table[i][1];
> >                                 *asc = sense_table[i][2];
> >                                 *ascq = sense_table[i][3];
> >                                 goto translate_done;
> >                         }
> >                 }
> >         }
> 
> That's the sense_table, I was referring to the stat_table. That table
> is consulted when we fail to convert via the sense_table.

Right... Reading emails on Monday morning without enough coffee was a
bad idea :)

So looking at the right code again, this is all very strange. E.g. the
ACS specs define bit 5 of the status field as the "device fault" bit,
but the code looks at 0x08, so bit 3. For write command, the definition
is:

STATUS
Bit Description
7:6 Transport Dependent – See 6.2.11
5 DEVICE FAULT bit – See 6.2.6
4 N/A
3 Transport Dependent – See 6.2.11
2 N/A
1 SENSE DATA AVAILABLE bit – See 6.2.9
0 ERROR bit – See 6.2.8

And the code is:

        static const unsigned char stat_table[][4] = {                
                /* Must be first because BUSY means no other bits valid
*/      
                {0x80,          ABORTED_COMMAND, 0x47, 0x00},         
                // Busy, fake parity for now                          
                {0x40,          ILLEGAL_REQUEST, 0x21, 0x04},         
                // Device ready, unaligned write command              
                {0x20,          HARDWARE_ERROR,  0x44, 0x00},         
                // Device fault, internal target failure              
                {0x08,          ABORTED_COMMAND, 0x47, 0x00},         
                // Timed out in xfer, fake parity for now             
                {0x04,          RECOVERED_ERROR, 0x11, 0x00},         
                // Recovered ECC error    Medium error, recovered     
                {0xFF, 0xFF, 0xFF, 0xFF}, // END mark                 
        };

So this does not match at all. Something wrong here, or, the "status"
field being observed here is not the one I am thinking of. Checking
AHCI & SATA-IO specs, I do not see anything matching there either.

Hu... Need to dig into this further. Missing something here.

> 
> > Are you using an SMR drive ? Getting an aligned write should not
> > happen for a
> > regular disk as the kernel ensure alignments of IO to LBAs. But for
> > SMR, it is
> > possible to send a write command that is not aligned to a zone
> > write pointer
> > position, resulting in an unaligned write error.
> 
> Which is why I am pretty sure that the "unaligned write" message is
> spurious since I am writing to a plain old SSD. It's going to be hard
> for a userspace program to generate a write that is no properly
> aligned for the SSD.

Unless your SSD is really buggy and throws strange errors, which is
always a possibility. Do you have a good reproducer of the problem ?

> 
> Cheers,
> Peter