On 2022/08/30 16:02, Peter Fröhlich wrote: > On Tue, Aug 30, 2022 at 1:26 AM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote: >> On Mon, 2022-08-29 at 08:04 +0200, Peter Fröhlich wrote: >>> That's the sense_table, I was referring to the stat_table. That table >>> is consulted when we fail to convert via the sense_table. >> ... >> So looking at the right code again, this is all very strange. E.g. the >> ACS specs define bit 5 of the status field as the "device fault" bit, >> but the code looks at 0x08, so bit 3. For write command, the definition >> is: >> >> STATUS >> Bit Description >> 7:6 Transport Dependent – See 6.2.11 >> 5 DEVICE FAULT bit – See 6.2.6 >> 4 N/A >> 3 Transport Dependent – See 6.2.11 >> 2 N/A >> 1 SENSE DATA AVAILABLE bit – See 6.2.9 >> 0 ERROR bit – See 6.2.8 >> >> And the code is: >> >> static const unsigned char stat_table[][4] = { >> /* Must be first because BUSY means no other bits valid >> */ >> {0x80, ABORTED_COMMAND, 0x47, 0x00}, >> // Busy, fake parity for now >> {0x40, ILLEGAL_REQUEST, 0x21, 0x04}, >> // Device ready, unaligned write command >> {0x20, HARDWARE_ERROR, 0x44, 0x00}, >> // Device fault, internal target failure >> {0x08, ABORTED_COMMAND, 0x47, 0x00}, >> // Timed out in xfer, fake parity for now >> {0x04, RECOVERED_ERROR, 0x11, 0x00}, >> // Recovered ECC error Medium error, recovered >> {0xFF, 0xFF, 0xFF, 0xFF}, // END mark >> }; >> >> So this does not match at all. Something wrong here, or, the "status" >> field being observed here is not the one I am thinking of. Checking >> AHCI & SATA-IO specs, I do not see anything matching there either. > > Thank you for confirming that this section *is* confusing. I was down > the same rabbit-hole checking "status" in whatever ATA spec I could > get my hands on, and it didn't help. Specifically for "WRITE DMA" > where I usually see the error, it seems that bit 6 has no other > meaning than "transport dependent" which for SATA means (I believe) > "drive ready" as it's always been. But that 0x40 is *not* an > "unaligned write" whatever else it may be. My suspicion is that maybe > it went in by accident since it's also in a "whitespace" commit. On > the other hand, it has an explicit comment. I wasn't going to bother > the original author before, but maybe now I should? +Hannes Except for bit 0x20 (device fault), the other bits do not match anything sensible either. So I wonder what specs this is against. Hannes ? 7-years old patch... I am sure your memory is very fresh about this one :) >>> Which is why I am pretty sure that the "unaligned write" message is >>> spurious since I am writing to a plain old SSD. It's going to be hard >>> for a userspace program to generate a write that is no properly >>> aligned for the SSD. >> >> Unless your SSD is really buggy and throws strange errors, which is >> always a possibility. Do you have a good reproducer of the problem ? > > Not really, sadly. For me it happens with SSDs behind a Marvell SATA > controller but it doesn't happen when the same SSD goes behind a > fancier SAS controller. This is what led me into the ATA/SCSI layer as > the possible culprit because on the SAS boxes that layer is not used. Yes, with a SAS HBA that has SAT implemented in FW, the HBA FW will do the conversion to sense data for failed commands. No way of knowing how that is done there. > BTW there's another "strange" effect that sometimes seems to lose the > LBA flag on the ATA taskfile struct resulting in an obscure error > message about failed CHS addressing. In that case I suspect an > initialization gone wrong or maybe a race condition somewhere, but > it's been a real pain to track down further. If I ever get a better > handle on how to repro this stuff, I certainly will share. Yes, that type of error generally means something goes badly during scanning or revalidate, e.g. access to a log page failing. That is a fairly common problems on many drives (e.g. drives advertising support for READ LOG DMA EXT but that command in fact not working). Your drive may need some quirks to get a reliable scan. Have you checked if your drive already has some entry in ata_device_blacklist (in libata-core.c) ? > > Cheers, > Peter -- Damien Le Moal Western Digital Research