Re: libata-scsi: ata_to_sense_error handling status 0x40

Peter Fröhlich <peter.hans.froehlich@xxxxxxxxx> · Mon, 12 Sep 2022 09:52:56 +0200

Apologies everybody, I dropped the ball on this a little, see below.

On Fri, Sep 2, 2022 at 10:41 AM Damien Le Moal
<damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
> On 9/2/22 15:34, Peter Fröhlich wrote:
> > I don't think the drive wants to "signal" anything, instead it simply
> > "disappears" at some point. The "original" error is "Emask 0x4
> > (timeout)". So here's an example from early on when I had not made
> > many kernel changes yet:
>
> Sounds like the drive FW is crashing...

That, or maybe the interaction between the SATA controller and the
drive not being the most awesome. As I hinted before, we've had this
with different disks, both WD and Samsung disks, but the controller is
the same. Which, BTW, is this one:

02:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11)

As far as I can tell, no major ATA quirks in the driver for that
thing, just a general PCI quirk that seems to apply to a bunch of
these Marvell chips.

> Are you running this drive with device/queue_depth set to 1 ? What is
> issuing a WRITE DMA instead of the NCQ equivalent ? Is this a passthrough
> command ?

The NCQ feature is indeed switched off because we've had problems with
other disks (spinning rust IIRC) crashing due to their NCQ
implementation being buggy. That's a different problem and has, to my
knowledge, nothing to do with the stuff here. Except that we're not
using "MULTI" commands without NCQ if I understand it correctly.

Here, finally, is why I "dropped the ball" on this thread. I played
with kernel command line parameters ON A LARK and it turns out that if
I say "libata.force=pio4" then for whatever reason all these issues go
away, I can no longer reproduce the timeouts or the attendant "wrong
error message" that made me post here originally. From what I gather
(and I may be very wrong here) forcing "pio4" makes the driver use yet
another set of commands, and THOSE commands seem to confuse neither
the SATA controller nor the disk anymore. Quick benchmarks showed some
loss in speed and we're still trying to figure out more of the
details, but again, the timeouts disappeared.

I don't like the fact that I am no closer to understanding what is
actually wrong here, but maybe this data point helps someone else
formulate a new theory of what's happening. BTW, despite "pio4" when
you ask the disks what they are using, they keep saying udma5. Another
thing I don't quite understand, but again, someone more knowledgeable
might.

Cheers,
Peter