Re: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests

Niklas Cassel <Niklas.Cassel@xxxxxxx> · Mon, 11 Sep 2023 12:40:26 +0000

On Mon, Sep 11, 2023 at 04:16:55PM +0900, Damien Le Moal wrote:
> On 9/8/23 06:02, Niklas Cassel wrote:
> > On Thu, Sep 07, 2023 at 03:43:19PM +0800, kernel test robot wrote:
> >>
> >>
> >> Hello,
> >>
> >> kernel test robot noticed "kernel_BUG_at_drivers/ata/libata-sff.c" on:
> >>
> >> commit: d3d099d5c2dd38db84abd96df39f9f0828c16b7b ("[PATCH v4] ata: libata-eh: Honor all EH scheduling requests")
> >> url: https://github.com/intel-lab-lkp/linux/commits/linan666-huaweicloud-com/ata-libata-eh-Honor-all-EH-scheduling-requests/20230906-164907
> >> base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 65d6e954e37872fd9afb5ef3fc0481bb3c2f20f4
> >> patch link: https://lore.kernel.org/all/20230906084212.1016634-1-linan666@xxxxxxxxxxxxxxx/
> >> patch subject: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests
> >>
> >> in testcase: boot
> >>
> >> compiler: gcc-12
> >> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > 
> > Unfortunately the problem reported by the kernel test robot is very real.
> > I could reproduce without too much effort in QEMU.
> > 
> > The problem is basically that we cannot simply perform a host_eh_scheduled--;
> > in ata_std_end_eh().
> > 
> > ata_std_end_eh() is called at the end of ata_scsi_port_error_handler(),
> > so it is called once every time ata_scsi_port_error_handler() is called.
> > 
> > However, ata_scsi_port_error_handler() will be called by SCSI EH each
> > time SCSI wakes up.
> > 
> > SCSI EH will sleep as long as:
> > if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
> >                     shost->host_failed != scsi_host_busy(shost)) {
> > 	schedule();
> > 	continue;
> > }
> > 
> > 
> > The methods in libata which we use to trigger EH are:
> > 
> > 1) ata_std_sched_eh(), which calls scsi_schedule_eh(), which does
> > host_eh_scheduled++;
> > 
> > 2) ata_qc_schedule_eh(), which will end up in scsi_timeout,
> > which calls scsi_eh_scmd_add() which does:
> > host_failed++;
> > 
> > 
> > So before this patch, setting host_eh_scheduled = 0; in ata_std_end_eh()
> > makes us say that works because it only negates the EH scheduled by
> > ata_std_sched_eh().
> > 
> > However, if we do host_eh_scheduled--, then if the EH was triggered by
> > ata_qc_schedule_eh(), then host_eh_scheduled will decrease < 0,
> > which will trigger SCSI EH to wake up again :)
> > 
> > We could do something like only decreasing host_eh_scheduled if it is > 0.
> > The QCs added to EH using ata_qc_schedule_eh() will be handled by
> > ata_eh_finish(), which will iterate over all QCs owned by EH, and will
> > either fail or retry each QC. After that scsi_error_handler() has finished
> > the call to eh_strategy_handler() (ata_scsi_error()) it will unconditionally
> > set host_failed to 0:
> > https://github.com/torvalds/linux/blob/v6.5/drivers/scsi/scsi_error.c#L2331-L2337
> > 
> > So something like this on top of the patch in $subject:
> > 
> > diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> > index 2d5ecd68b7e0..9ab12d7f6d9f 100644
> > --- a/drivers/ata/libata-eh.c
> > +++ b/drivers/ata/libata-eh.c
> > @@ -952,7 +952,13 @@ EXPORT_SYMBOL_GPL(ata_std_sched_eh);
> >   */
> >  void ata_std_end_eh(struct ata_port *ap)
> >  {
> > -       ap->scsi_host->host_eh_scheduled--;
> > +       struct Scsi_Host *host = ap->scsi_host;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(host->host_lock, flags);
> > +       if (host->host_eh_scheduled > 0)
> > +               host->host_eh_scheduled--;
> > +       spin_unlock_irqrestore(host->host_lock, flags);
> >  }
> >  EXPORT_SYMBOL(ata_std_end_eh);
> > 
> > 
> > With that incremental patch, I can no longer reproduce the crash reported
> > by the kernel test robot in my QEMU setup.
> 
> I am not confident that playing with host_eh_schedule count is the right
> approach. A better solution may be to change the timing of clearing
> ATA_PFLAG_EH_PENDING. Right now, this is done on entry to
> ata_scsi_port_error_handler(), unconditionally. So ata_eh_reset() should not
> need to clear the flag again. If we remove that, then a new interrupt received
> after ata_eh_thaw() and setting EH_PENDING would be cought by the retry loop in
> ata_scsi_port_error_handler(), which would run again ap->ops->error_handler(ap).

I agree that removing the clearing of ATA_PFLAG_EH_PENDING from ata_eh_reset()
would be a nicer fix, if that can be done without introducing regressions.

Looking at git blame:
https://github.com/torvalds/linux/commit/1e641060c4b564e820abdb6a4c7a603a0d386250

It seems that the patch was added because some controllers "may set spuruious
error conditions during reset."

Looking at the original bug report that resulted in this workaround:
https://marc.info/?l=linux-ide&m=124765325828495&w=2

PxIS is:
[    1.497808] ata3: irq_stat 0x00400040, connection status changed

Which means that PxIS.PCS and PxIS.PRCS was set.

PxIS.PCS:
"Port Connect Change Status (PCS): 1=Change in Current Connect Status. 0=No
change in Current Connect Status. This bit reflects the state of PxSERR.DIAG.X.
This bit is only cleared when PxSERR.DIAG.X is cleared."

PxIS.PRCS:
"PhyRdy Change Status (PRCS): When set to ‘1’ indicates the internal PhyRdy signal
changed state. This bit reflects the state of PxSERR.DIAG.N. To clear this bit,
software must clear PxSERR.DIAG.N to ‘0’."

So it seems that these IRQ bits get set as a result of the COMRESET itself.
Thus the patch in:
https://lore.kernel.org/linux-ide/b991c72c-99da-e4f2-055b-fa8b12e0efc4@xxxxxxxxxx/T/#t
which clears PxIS before the COMRESET), is not relevant to this workaround.

Looking a bit closer at the original bug report, the user is simply complaining
that he is seeing a hard reset after probing:
[    1.521856] ata8: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xf t4
[    1.521908] ata8: irq_stat 0x00400040, connection status changed
[    1.521962] ata8: hard resetting link

But other than that, the HBA and drive works as it should:
https://marc.info/?l=linux-ide&m=124709313323657&w=2

Looking at the AHCI 1.3.1 specification, after the host has issued a COMRESET,
the device will reply will a COMINIT.

"When a COMINIT is received, the PxSSTS.DET field shall be set to 1h.
When the communication negotiation sequence is complete and PhyRdy is true
the PxSSTS.DET field shall be set to 3h."

And according to "10.4.2 Port Reset" in AHCI 1.3.1, after issuing a COMRESET,
software must wait for PxSSTS.DET to be set to 3h.

This is done by sata_link_hardreset(), which calls sata_link_resume(),
which calls sata_link_debounce(), which waits for PxSSTS.DET to be != 0x1,
additionally it will return 0 if PxSSTS.DET == 0x1 and timeout is reached.......
(This is a bit unfortunate, and not according to AHCI spec, as it should
strictly wait for it to be 0x3... perhaps there should at least be a warning
if we return while the value is not 0x3.....)

While calling sata_link_hardreset(), port is frozen/IRQs are masked, so this
shouldn't trigger a PhyRdy IRQ. Additionally thawing the port clears pending
IRQs, so any COMINIT received before PxSSTS.DET gets set to 0x3 should not
trigger an IRQ when thawing the port.

If we assume that sata_link_debounce() is not broken and the reason why
sata_link_debounce() returned was because PxSSTS.DET == 0x3, then the most
logical thing is that this is an unsolicited COMINIT (i.e. we received a
COMINIT after PxSSTS.DET was set to 0x3), see:
"6.2.2.3 Recovery of Unsolicited COMINIT" in AHCI 1.3.1.

The spec claims that we should perform an additional COMRESET when this happens
(when PxIS.PCS is set), which is exactly what libahci does:
https://github.com/torvalds/linux/blob/v6.5/drivers/ata/libahci.c#L1834
https://github.com/torvalds/linux/blob/v6.5/include/linux/libata.h#L1555

Since we clear both PxIS and PxSERR after a reset, before thawing the port,
the only logical thing is that we are receiving an unsolicited COMINIT,
and when receiving an unsolicited COMINIT, the proper action is to perform
an additional hardreset. Since the original author was simply complaining
about reset print in dmesg, and that his system was still working, I suggest
that we remove the clearing of ATA_PFLAG_EH_PENDING from ata_eh_reset(), as
it appears that everything is working as per the spec.

Perhaps we should also add a warning in sata_link_debounce() or
sata_link_resume(), so that we can be confident that PxSSTS.DET is 0x3 before
clearing SError and PxIS. (Because if this is not the case, then the bug is
that libata is continuing/resuming before the link has been fully established.)

Kind regards,
Niklas