Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

John Garry <john.garry@xxxxxxxxxx> · Thu, 6 Oct 2022 17:41:40 +0100

On 06/10/2022 15:45, Niklas Cassel wrote:
I think that it gets frozen when the internal command for read log ext times
out. More below about that timeout.
ata_read_log_page() will first try to read using READ LOG DMA EXT.
If that fails it will retry with READ LOG EXT.

Your log has this:
[  350.257870] ata1.00: qc timeout (cmd 0x47)

So it is definitely ATA_CMD_READ_LOG_DMA_EXT that times out.

On timeout, ata_exec_internal_sg() will freeze the port:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?h=v6.0#n1577

When ata_read_log_page() retries with the port frozen,
READ LOG EXT will obviously fail (since the port is frozen).

Not sure why READ LOG DMA EXT would timeout for you...
Perhaps your drive does not implement this command,
and incorrectly reports supporting this command via
ata_id_has_read_log_dma_ext().

Perhaps you could try boot your kernel with libata.force=nodmalog
on the kernel command line, so that ata_read_log_page() will use
READ LOG EXT on the first try.


I tried that and unfortunately it does not appear to help.

I get this log, which proves no dmalog

[   15.757617] ata1.00: FORCE: horkage modified (nodmalog)

but then still fails with timeout:

[  123.094430] ata1.00: qc timeout (cmd 0x2f)
[  123.098637] pm80xx0:: mpi_sata_completion 2293: task null, freeing 
CCB tag 2
[  123.105711] ata1.00: Read log 0x10 page 0x00 failed, Emask 0x5
[  123.118081] ata1: failed to read log page 10h (errno=-5)


Damien, it seems that there is no use in retrying if the port
is frozen/we got a timeout, so perhaps:

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e74ab6c0f1a0..1aa628332c8e 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log,
         if (err_mask) {
                 if (dma) {
                         dev->horkage |= ATA_HORKAGE_NO_DMA_LOG;
-                       goto retry;
+                       if (err_mask != AC_ERR_TIMEOUT)
+                               goto retry;
                 }

or:

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e74ab6c0f1a0..2fa03b7573ac 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log,
         if (err_mask) {
                 if (dma) {
                         dev->horkage |= ATA_HORKAGE_NO_DMA_LOG;
-                       goto retry;
+                       if (!(dev->link->ap->pflags & ATA_PFLAG_FROZEN))
+                               goto retry;
                 }

would be in order, so that we actually print the real error, instead of a bogus
AC_ERR_SYSTEM (returned by ata_exec_internal_sg()) when the port is frozen.

ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989

So I'm not sure why your port is frozen.
(The fast drain timer does freeze the port, but it shouldn't be enabled.)
It might be worthwhile to see who freezes the port in your case.
Might come from the command timeout. John has had many problems with the
pm80xx HBA in his Arm machine from a while back. Likely not a driver issue
but a hw one... No-one seems to be able to recreate the same problem.

We need to try the HBA on our Arm board to see what happens.

Yeah, it just looks to be the longstanding issue of using this card on my
arm64 machine - that is that I get IO timeouts quite regularly. I should
have mentioned that yesterday. This just seems to be a driver issue.
Out of curiosity, which arm64 SoC is this?

HiSilicon hi1620 which contains a custom arm v8 implementation. Note 
that others have also seen the issue with this card on other arm 
implementations.


While it is very unlikely that this is your problem, but I've encountered
an issue on an ARM board before, where the PCIe controller was incorrectly
configured in device tree, causing the controller to miss interrrupts,
which presented itself to the user as timeouts in the WiFi driver:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97131f85c08e024df49480ed499aae8fb754067f

Unlikely. Indeed, when I was checking this issue some time go, I found 
that not only was there no completion interrupt but also no completion 
when I manually examine the completion ring buffer read and write pointers.

Here's where I discuss this issue earlier a bit:
https://lore.kernel.org/linux-scsi/PH0PR11MB511238B8FF7B44C375DDDFADEC519@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

Thanks,
John