Re: [PATCH v2 2/2] PCI: reread the Link Control 2 Register before using

"Maciej W. Rozycki" <macro@xxxxxxxxxxx> · Thu, 16 Jan 2025 15:00:28 +0000 (GMT)

On Thu, 16 Jan 2025, Ilpo Järvinen wrote:

> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 76f4df75b08a..02d2e16672a8 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -123,6 +123,7 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
> >  			return ret;
> >  		}
> >  
> > +		pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> >  		pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> >  	}
> 
> I started to wonder if there would be a better way to fix this. If I've 
> understood Maciej's intent correctly, there are two cases when the 2nd 
> step (the one lifting the 2.5GT/s restriction) should be used:
> 
> 1) TLS is 2.5GT/s at the entry to the quirk and it's an ASMedia switch.

 Correct.

> 2) If the quirk lowered TLS to 2.5GT/s and the link became up fine 
> because of that. This also currently checks for ASMedia but in the v1 you 
> also proposed to change that. We know it works in some cases with ASMedia 
> but are unsure what happens with other switches.

 Correct.

> In the case 2, we don't need to check TLS since the function itself asked 
> TLS to be lowered which did not return an error, so we know the speed was 
> lowered so why spend time on rereading the register? A simple local 
> variable could convey the same information.

 Agreed.

> In case 1, I don't think ASMedia check should be removed. It was about a 
> case where FW has a workaround to lower the speed (IIRC). If Link Speed is 
> 2.5GT/s at entry but it's not ASMedia switch, there's no 2.5GT/s 
> restriction to lift.

 Your recollection is correct.  Perhaps we ought to lift the restriction 
instead based on the ID of the downstream device though.

 NB referring our earlier discussion the system in question is now at:

# uptime
 15:31:25 up 160 days,  2:58,  1 user,  load average: 0.00, 0.00, 0.00
# 

and the questionable link has recorded no LBMS events except for one on 
the day it was booted or maybe a day after (which I cleared back then):

# setpci -s 02:03.0 CAP_EXP+0x12.W
3012
# 

There has been some network data traffic across the link, not excessively 
high, but not insignificant either:

        RX packets 181328169  bytes 2178598128 (2.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 267194647  bytes 1149978069 (1.0 GiB)
        TX errors 0  dropped 2 overruns 0  carrier 0  collisions 0

So I think we can safely conclude signalling is free from disruption at 
the electrical level and all the mess with link training at speeds beyond 
2.5GT/s is owing to some kind of a protocol interpretation issue at the 
data link layer with either or both devices, and the downstream device 
being the higher suspect based on other issues with Pericom/Diodes gear.  

 Would you agree with my conclusion?

 NB I continue being busy with other stuff, but please feel free to ping 
me directly if you need any input from me.

  Maciej