On 10/6/22 23:45, Niklas Cassel wrote: > On Thu, Oct 06, 2022 at 09:33:23AM +0100, John Garry wrote: >> On 05/10/2022 23:42, Damien Le Moal wrote: >>>> Hello Damien, >>>> >>>> John explained that he got a timeout from EH when reading the log: >>>> [ 350.281581] ata1: failed to read log page 10h (errno=-5) >>>> [ 350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen >>>> >>>> ata_eh_read_log_10h() uses ata_read_log_page(), which will first try to read >>>> the log using READ LOG DMA EXT. If that fails, it will retry using READ LOG EXT. >>>> >>>> Therefore, to see if this is a driver specific bug, I suggested to try to read >>>> the NCQ Command Error log using ATA16 passthrough commands: >>>> >>>> $ sudo sg_sat_read_gplog -d --log=0x10 /dev/sdc >>>> will read the log using READ LOG DMA EXT. >>>> >>>> $ sudo sg_sat_read_gplog --log=0x10 /dev/sdc >>>> will read the log using READ LOG EXT. >> >> Note that I can't get a distro to boot on this system from the HDD for the >> same timeout problem (so no tools easily available). >> >>>> >>>> Neither of these two suggested commands are NCQ commands. >>>> (Neither command is encapsulated in a RECEIVE FPDMA QUEUED, >>>> so I'm not sure what you mean.) >>>> >>>> >>>> Garry, I now see that: >>>> [ 350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen >>>> Your port is frozen. >>>> >>>> ata_read_log_page() calls ata_exec_internal() which calls ata_exec_internal_sg(), >>>> which will simply return an error without sending down the command to the drive, >>>> if the port is frozen. >>>> >>>> Not sure why your port is frozen, mine is obviously not. >> >> I think that it gets frozen when the internal command for read log ext times >> out. More below about that timeout. > > ata_read_log_page() will first try to read using READ LOG DMA EXT. > If that fails it will retry with READ LOG EXT. > > Your log has this: > [ 350.257870] ata1.00: qc timeout (cmd 0x47) > > So it is definitely ATA_CMD_READ_LOG_DMA_EXT that times out. > > On timeout, ata_exec_internal_sg() will freeze the port: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?h=v6.0#n1577 > > When ata_read_log_page() retries with the port frozen, > READ LOG EXT will obviously fail (since the port is frozen). > > Not sure why READ LOG DMA EXT would timeout for you... > Perhaps your drive does not implement this command, > and incorrectly reports supporting this command via > ata_id_has_read_log_dma_ext(). > > Perhaps you could try boot your kernel with libata.force=nodmalog > on the kernel command line, so that ata_read_log_page() will use > READ LOG EXT on the first try. > > > Damien, it seems that there is no use in retrying if the port > is frozen/we got a timeout, so perhaps: > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index e74ab6c0f1a0..1aa628332c8e 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log, > if (err_mask) { > if (dma) { > dev->horkage |= ATA_HORKAGE_NO_DMA_LOG; > - goto retry; > + if (err_mask != AC_ERR_TIMEOUT) > + goto retry; > } > > or: > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index e74ab6c0f1a0..2fa03b7573ac 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -2035,7 +2035,8 @@ unsigned int ata_read_log_page(struct ata_device *dev, u8 log, > if (err_mask) { > if (dma) { > dev->horkage |= ATA_HORKAGE_NO_DMA_LOG; > - goto retry; > + if (!(dev->link->ap->pflags & ATA_PFLAG_FROZEN)) > + goto retry; Yes, something like this is needed. Though I would prefer having a little ata_port_frozen() helper for the condition. > } > > would be in order, so that we actually print the real error, instead of a bogus > AC_ERR_SYSTEM (returned by ata_exec_internal_sg()) when the port is frozen. yep. >> >>>> >>>> ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain: >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989 >>>> >>>> So I'm not sure why your port is frozen. >>>> (The fast drain timer does freeze the port, but it shouldn't be enabled.) >>>> It might be worthwhile to see who freezes the port in your case. >>> Might come from the command timeout. John has had many problems with the >>> pm80xx HBA in his Arm machine from a while back. Likely not a driver issue >>> but a hw one... No-one seems to be able to recreate the same problem. >>> >>> We need to try the HBA on our Arm board to see what happens. >>> >> >> Yeah, it just looks to be the longstanding issue of using this card on my >> arm64 machine - that is that I get IO timeouts quite regularly. I should >> have mentioned that yesterday. This just seems to be a driver issue. > > Out of curiosity, which arm64 SoC is this? > > While it is very unlikely that this is your problem, but I've encountered > an issue on an ARM board before, where the PCIe controller was incorrectly > configured in device tree, causing the controller to miss interrrupts, > which presented itself to the user as timeouts in the WiFi driver: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97131f85c08e024df49480ed499aae8fb754067f > > > Kind regards, > Niklas -- Damien Le Moal Western Digital Research