Re: [PATCH AUTOSEL 4.14 6/6] ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()

Niklas Cassel <Niklas.Cassel@xxxxxxx> · Thu, 28 Sep 2023 16:00:24 +0000

Hello Sasha,

The upstream commit that you are backporting here:
80cc944eca4f ("ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()")

Requires that we have:
737dd811a3db ("ata: libahci: clear pending interrupt status")
in stable. This is currently not the case.

See my comment in:
https://lore.kernel.org/stable/ZRWf7Avtdv3DeqCU@x1-carbon/T/#t

Perhaps both:
93c7711494f4 ("ata: ahci: Drop pointless VPRINTK() calls and convert the remaining ones")
and
737dd811a3db ("ata: libahci: clear pending interrupt status")

Could be backported to stable.
That way,
80cc944eca4f ("ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()")
should be safe to backport to stable as well.

Kind regards,
Niklas

On Sun, Sep 24, 2023 at 09:20:49AM -0400, Sasha Levin wrote:
> From: Niklas Cassel <niklas.cassel@xxxxxxx>
> 
> [ Upstream commit 80cc944eca4f0baa9c381d0706f3160e491437f2 ]
> 
> ata_scsi_port_error_handler() starts off by clearing ATA_PFLAG_EH_PENDING,
> before calling ap->ops->error_handler() (without holding the ap->lock).
> 
> If an error IRQ is received while ap->ops->error_handler() is running,
> the irq handler will set ATA_PFLAG_EH_PENDING.
> 
> Once ap->ops->error_handler() returns, ata_scsi_port_error_handler()
> checks if ATA_PFLAG_EH_PENDING is set, and if it is, another iteration
> of ATA EH is performed.
> 
> The problem is that ATA_PFLAG_EH_PENDING is not only cleared by
> ata_scsi_port_error_handler(), it is also cleared by ata_eh_reset().
> 
> ata_eh_reset() is called by ap->ops->error_handler(). This additional
> clearing done by ata_eh_reset() breaks the whole retry logic in
> ata_scsi_port_error_handler(). Thus, if an error IRQ is received while
> ap->ops->error_handler() is running, the port will currently remain
> frozen and will never get re-enabled.
> 
> The additional clearing in ata_eh_reset() was introduced in commit
> 1e641060c4b5 ("libata: clear eh_info on reset completion").
> 
> Looking at the original error report:
> https://marc.info/?l=linux-ide&m=124765325828495&w=2
> 
> We can see the following happening:
> [    1.074659] ata3: XXX port freeze
> [    1.074700] ata3: XXX hardresetting link, stopping engine
> [    1.074746] ata3: XXX flipping SControl
> 
> [    1.411471] ata3: XXX irq_stat=400040 CONN|PHY
> [    1.411475] ata3: XXX port freeze
> 
> [    1.420049] ata3: XXX starting engine
> [    1.420096] ata3: XXX rc=0, class=1
> [    1.420142] ata3: XXX clearing IRQs for thawing
> [    1.420188] ata3: XXX port thawed
> [    1.420234] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 
> We are not supposed to be able to receive an error IRQ while the port is
> frozen (PxIE is set to 0, i.e. all IRQs for the port are disabled).
> 
> AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) states:
> "Each bit location can be thought of as reporting a '1' if the virtual
> "interrupt line" for that port is indicating it wishes to generate an
> interrupt. That is, if a port has one or more interrupt status bit set,
> and the enables for those status bits are set, then this bit shall be set."
> 
> Additionally, AHCI state P:ComInit clearly shows that the state machine
> will only jump to P:ComInitSetIS (which sets IS.IPS(x) to '1'), if PxIE.PCE
> is set to '1'. In our case, PxIE is set to 0, so IS.IPS(x) won't get set.
> 
> So IS.IPS(x) only gets set if PxIS and PxIE is set.
> 
> AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) also states:
> "The bits in this register are read/write clear. It is set by the level of
> the virtual interrupt line being a set, and cleared by a write of '1' from
> the software."
> 
> So if IS.IPS(x) is set, you need to explicitly clear it by writing a 1 to
> IS.IPS(x) for that port.
> 
> Since PxIE is cleared, the only way to get an interrupt while the port is
> frozen, is if IS.IPS(x) is set, and the only way IS.IPS(x) can be set when
> the port is frozen, is if it was set before the port was frozen.
> 
> However, since commit 737dd811a3db ("ata: libahci: clear pending interrupt
> status"), we clear both PxIS and IS.IPS(x) after freezing the port, but
> before the COMRESET, so the problem that commit 1e641060c4b5 ("libata:
> clear eh_info on reset completion") fixed can no longer happen.
> 
> Thus, revert commit 1e641060c4b5 ("libata: clear eh_info on reset
> completion"), so that the retry logic in ata_scsi_port_error_handler()
> works once again. (The retry logic is still needed, since we can still
> get an error IRQ _after_ the port has been thawed, but before
> ata_scsi_port_error_handler() takes the ap->lock in order to check
> if ATA_PFLAG_EH_PENDING is set.)
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
> Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> ---
>  drivers/ata/libata-eh.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index cbe9af624a06f..8a789de056807 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -2948,18 +2948,11 @@ int ata_eh_reset(struct ata_link *link, int classify,
>  			postreset(slave, classes);
>  	}
>  
> -	/*
> -	 * Some controllers can't be frozen very well and may set spurious
> -	 * error conditions during reset.  Clear accumulated error
> -	 * information and re-thaw the port if frozen.  As reset is the
> -	 * final recovery action and we cross check link onlineness against
> -	 * device classification later, no hotplug event is lost by this.
> -	 */
> +	/* clear cached SError */
>  	spin_lock_irqsave(link->ap->lock, flags);
> -	memset(&link->eh_info, 0, sizeof(link->eh_info));
> +	link->eh_info.serror = 0;
>  	if (slave)
> -		memset(&slave->eh_info, 0, sizeof(link->eh_info));
> -	ap->pflags &= ~ATA_PFLAG_EH_PENDING;
> +		slave->eh_info.serror = 0;
>  	spin_unlock_irqrestore(link->ap->lock, flags);
>  
>  	if (ap->pflags & ATA_PFLAG_FROZEN)
> -- 
> 2.40.1
>