Hello Sasha, The upstream commit that you are backporting here: 80cc944eca4f ("ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()") Requires that we have: 737dd811a3db ("ata: libahci: clear pending interrupt status") in stable. This is currently not the case. See my comment in: https://lore.kernel.org/stable/ZRWf7Avtdv3DeqCU@x1-carbon/T/#t Perhaps both: 93c7711494f4 ("ata: ahci: Drop pointless VPRINTK() calls and convert the remaining ones") and 737dd811a3db ("ata: libahci: clear pending interrupt status") Could be backported to stable. That way, 80cc944eca4f ("ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()") should be safe to backport to stable as well. Kind regards, Niklas On Sun, Sep 24, 2023 at 09:20:49AM -0400, Sasha Levin wrote: > From: Niklas Cassel <niklas.cassel@xxxxxxx> > > [ Upstream commit 80cc944eca4f0baa9c381d0706f3160e491437f2 ] > > ata_scsi_port_error_handler() starts off by clearing ATA_PFLAG_EH_PENDING, > before calling ap->ops->error_handler() (without holding the ap->lock). > > If an error IRQ is received while ap->ops->error_handler() is running, > the irq handler will set ATA_PFLAG_EH_PENDING. > > Once ap->ops->error_handler() returns, ata_scsi_port_error_handler() > checks if ATA_PFLAG_EH_PENDING is set, and if it is, another iteration > of ATA EH is performed. > > The problem is that ATA_PFLAG_EH_PENDING is not only cleared by > ata_scsi_port_error_handler(), it is also cleared by ata_eh_reset(). > > ata_eh_reset() is called by ap->ops->error_handler(). This additional > clearing done by ata_eh_reset() breaks the whole retry logic in > ata_scsi_port_error_handler(). Thus, if an error IRQ is received while > ap->ops->error_handler() is running, the port will currently remain > frozen and will never get re-enabled. > > The additional clearing in ata_eh_reset() was introduced in commit > 1e641060c4b5 ("libata: clear eh_info on reset completion"). > > Looking at the original error report: > https://marc.info/?l=linux-ide&m=124765325828495&w=2 > > We can see the following happening: > [ 1.074659] ata3: XXX port freeze > [ 1.074700] ata3: XXX hardresetting link, stopping engine > [ 1.074746] ata3: XXX flipping SControl > > [ 1.411471] ata3: XXX irq_stat=400040 CONN|PHY > [ 1.411475] ata3: XXX port freeze > > [ 1.420049] ata3: XXX starting engine > [ 1.420096] ata3: XXX rc=0, class=1 > [ 1.420142] ata3: XXX clearing IRQs for thawing > [ 1.420188] ata3: XXX port thawed > [ 1.420234] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > > We are not supposed to be able to receive an error IRQ while the port is > frozen (PxIE is set to 0, i.e. all IRQs for the port are disabled). > > AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) states: > "Each bit location can be thought of as reporting a '1' if the virtual > "interrupt line" for that port is indicating it wishes to generate an > interrupt. That is, if a port has one or more interrupt status bit set, > and the enables for those status bits are set, then this bit shall be set." > > Additionally, AHCI state P:ComInit clearly shows that the state machine > will only jump to P:ComInitSetIS (which sets IS.IPS(x) to '1'), if PxIE.PCE > is set to '1'. In our case, PxIE is set to 0, so IS.IPS(x) won't get set. > > So IS.IPS(x) only gets set if PxIS and PxIE is set. > > AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) also states: > "The bits in this register are read/write clear. It is set by the level of > the virtual interrupt line being a set, and cleared by a write of '1' from > the software." > > So if IS.IPS(x) is set, you need to explicitly clear it by writing a 1 to > IS.IPS(x) for that port. > > Since PxIE is cleared, the only way to get an interrupt while the port is > frozen, is if IS.IPS(x) is set, and the only way IS.IPS(x) can be set when > the port is frozen, is if it was set before the port was frozen. > > However, since commit 737dd811a3db ("ata: libahci: clear pending interrupt > status"), we clear both PxIS and IS.IPS(x) after freezing the port, but > before the COMRESET, so the problem that commit 1e641060c4b5 ("libata: > clear eh_info on reset completion") fixed can no longer happen. > > Thus, revert commit 1e641060c4b5 ("libata: clear eh_info on reset > completion"), so that the retry logic in ata_scsi_port_error_handler() > works once again. (The retry logic is still needed, since we can still > get an error IRQ _after_ the port has been thawed, but before > ata_scsi_port_error_handler() takes the ap->lock in order to check > if ATA_PFLAG_EH_PENDING is set.) > > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> > Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> > Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx> > --- > drivers/ata/libata-eh.c | 13 +++---------- > 1 file changed, 3 insertions(+), 10 deletions(-) > > diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c > index cbe9af624a06f..8a789de056807 100644 > --- a/drivers/ata/libata-eh.c > +++ b/drivers/ata/libata-eh.c > @@ -2948,18 +2948,11 @@ int ata_eh_reset(struct ata_link *link, int classify, > postreset(slave, classes); > } > > - /* > - * Some controllers can't be frozen very well and may set spurious > - * error conditions during reset. Clear accumulated error > - * information and re-thaw the port if frozen. As reset is the > - * final recovery action and we cross check link onlineness against > - * device classification later, no hotplug event is lost by this. > - */ > + /* clear cached SError */ > spin_lock_irqsave(link->ap->lock, flags); > - memset(&link->eh_info, 0, sizeof(link->eh_info)); > + link->eh_info.serror = 0; > if (slave) > - memset(&slave->eh_info, 0, sizeof(link->eh_info)); > - ap->pflags &= ~ATA_PFLAG_EH_PENDING; > + slave->eh_info.serror = 0; > spin_unlock_irqrestore(link->ap->lock, flags); > > if (ap->pflags & ATA_PFLAG_FROZEN) > -- > 2.40.1 >