Re: [PATCH 1/2] PCI: Clear LBMS on resume to avoid Target Speed quirk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 30 Jan 2024, Ilpo Järvinen wrote:

> On Tue, 30 Jan 2024, Maciej W. Rozycki wrote:
> 
> > On Tue, 30 Jan 2024, Ilpo Järvinen wrote:
> > 
> > > First of all, this is not true for pcie_failed_link_retrain():
> > >  * Return TRUE if the link has been successfully retrained, otherwise FALSE.
> > > If LBMS is not set, the Target Speed quirk is not applied but the function 
> > > still returns true. I think that should be changed to early return false
> > > when no LBMS is present.
> > 
> >  I think there is a corner case here indeed.  If Link Active reporting is 
> > supported and neither DLLLA nor LBMS are set at entry, then the function 
> > indeed returns success even though the link is down and no attempt to 
> > retrain will have been made.  So this does indeed look a case for a return 
> > with the FALSE result.
> > 
> >  I think most easily this can be sorted by delegating the return result to 
> > a temporary variable, preset to FALSE and then only updated from results 
> > of the calls to `pcie_retrain_link'.  I can offer a patch as the author of 
> > the code and one who has means to verify it right away if that helped.
> 
> I already wrote a patch which addresses this together with the caller 
> site changes.
> 
> >  Overall I guess it's all legacy of how this code evolved before it's been 
> > finally merged.
> 
> Indeed, it looks the step by step changed didn't yield good result here.
> 
> > > > >   3. pci_bridge_wait_for_secondary_bus() then calls pci_dev_wait() which
> > > > >      cannot succeed (but waits ~1 minute, delaying the resume).
> > > > > 
> > > > > The Target Speed trick (in step 2) is only used if LBMS bit (PCIe r6.1
> > > > > sec 7.5.3.8) is set. For links that have been operational before
> > > > > suspend, it is well possible that LBMS has been set at the bridge and
> > > > > remains on. Thus, after resume, LBMS does not indicate the link needs
> > > > > the Target Speed quirk. Clear LBMS on resume for bridges to avoid the
> > > > > issue.
> > 
> >  This can be problematic AFAICT however.  While I am not able to verify 
> > suspend/resume operation with my devices, I expect the behaviour to be 
> > exactly the same after resume as after a bus reset: the link will fail to 
> > negotiate and the LBMS and DLLLA bits will be respectively set and clear.  
> > Consequently if you clear LBMS at resume, then the workaround won't 
> > trigger and the link will remain inoperational in its limbo state.
> 
> How do you get the LBMS set in the first place? Isn't that because the 
> link tries to come up so shouldn't it reassert that bit again before the 
> code ends up into the target speed quirk? That is, I assumed you actually 
> wanted to detect LBMS getting set during pcie_wait_for_link_status() call 
> preceeding pcie_failed_link_retrain() call?
> 
> There's always an option to store it into pci_dev for later use when the 
> quirk is found to be working for the first time if other solutions don't 
> work.
> 
> In any case and unrelated to this patch, the way this quirk monopolizes 
> LBMS bit is going to have to be changed because it won't be reliable with 
> the PCIe BW controller that sets up and irq for LBMS (and clears the bit).
> In bwctrl v5 (yet to be posted) I'll add LBMS counter into bwctrl to allow 
> this quirk to keep working (which will need to be confirmed).
> 
> >  What kind of scenario does the LBMS bit get set in that you have a 
> > trouble with?  You write that in your case the downstream device has been 
> > disconnected while the bug hierarchy was suspended and it is not present 
> > anymore at resume, is that correct?
> >
> >  But in that case no link negotiation could have been possible and 
> > consequently the LBMS bit mustn't have been set by hardware (according to 
> > my understanding of PCIe), because (for compliant, non-broken devices 
> > anyway) it is only specified to be set for ports that can communicate with 
> > the other link end (the spec explicitly says there mustn't have been a 
> > transition through the DL_Down status for the port).
> >
> >  Am I missing something?
> 
> Yes, when resuming the device is already gone but the bridge still has 
> LBMS set. My understanding is that it was set because it was there
> from pre-suspend time but I've not really taken a deep look into it 
> because the problem and fix seemed obvious.
> 
> I read that "without the Port transitioning through DL_Down status" 
> differently than you, I only interpret that it relates to the two 
> bullets following it. ...So if retrain bit is set, and link then goes 
> down, the bullet no longer applies and LBMS should not be set because 
> there was transition through DL_Down. But I could well be wrong...

Hi again,

I went to check Root Ports on some machines and toggled their Link 
Disable bit on. None of the Root Ports I tested cleared LBMS. That means
LBMS being on after resume is not enough to differentiate the HW which 
needs Target Speed quirk and which does not. Unless we clear LBMS at some 
point which was what this patch tried to do.

So this misidentification problem in the quirk looks quite widespread but 
is mostly dormant because Links do come up normally so the quirk code is 
never called.

I might conduct even larger set of tests once I script a way to 
automatically pick a "safe" Root Port (something that is connected to a 
device but unused, I manually picked Root Ports above unused NICs for the 
manual tests I already did). But I don't believe it changes anything and 
in the end I won't find any Root Ports that would actually clear LBMS on 
their own when Link goes down.

So I would be really curious now to know how you get the LBMS on the 
device that needs the Target Speed quirk? Is this true (from the commit 
a89c82249c37 ("PCI: Work around PCIe link training failures")):

"Instead the link continues oscillating between the two speeds, at the 
rate of 34-35 times per second, with link training reported repeatedly 
active ~84% of the time."

?

Because if it is constantly picking another speed, it would mean you get 
LBMS set over and over again, no? If that happens 34-35 times per second, 
it should be set already again when we get into that quirk because there 
was some wait before it gets called.

-- 
 i.

[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux