Apologies everybody, I dropped the ball on this a little, see below. On Fri, Sep 2, 2022 at 10:41 AM Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote: > On 9/2/22 15:34, Peter Fröhlich wrote: > > I don't think the drive wants to "signal" anything, instead it simply > > "disappears" at some point. The "original" error is "Emask 0x4 > > (timeout)". So here's an example from early on when I had not made > > many kernel changes yet: > > Sounds like the drive FW is crashing... That, or maybe the interaction between the SATA controller and the drive not being the most awesome. As I hinted before, we've had this with different disks, both WD and Samsung disks, but the controller is the same. Which, BTW, is this one: 02:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11) As far as I can tell, no major ATA quirks in the driver for that thing, just a general PCI quirk that seems to apply to a bunch of these Marvell chips. > Are you running this drive with device/queue_depth set to 1 ? What is > issuing a WRITE DMA instead of the NCQ equivalent ? Is this a passthrough > command ? The NCQ feature is indeed switched off because we've had problems with other disks (spinning rust IIRC) crashing due to their NCQ implementation being buggy. That's a different problem and has, to my knowledge, nothing to do with the stuff here. Except that we're not using "MULTI" commands without NCQ if I understand it correctly. Here, finally, is why I "dropped the ball" on this thread. I played with kernel command line parameters ON A LARK and it turns out that if I say "libata.force=pio4" then for whatever reason all these issues go away, I can no longer reproduce the timeouts or the attendant "wrong error message" that made me post here originally. From what I gather (and I may be very wrong here) forcing "pio4" makes the driver use yet another set of commands, and THOSE commands seem to confuse neither the SATA controller nor the disk anymore. Quick benchmarks showed some loss in speed and we're still trying to figure out more of the details, but again, the timeouts disappeared. I don't like the fact that I am no closer to understanding what is actually wrong here, but maybe this data point helps someone else formulate a new theory of what's happening. BTW, despite "pio4" when you ask the disks what they are using, they keep saying udma5. Another thing I don't quite understand, but again, someone more knowledgeable might. Cheers, Peter