Re: [PATCH 1/2] PCI: Clear LBMS on resume to avoid Target Speed quirk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 1 Feb 2024, Maciej W. Rozycki wrote:

> On Thu, 1 Feb 2024, Ilpo Järvinen wrote:

> > > >  This can be problematic AFAICT however.  While I am not able to verify 
> > > > suspend/resume operation with my devices, I expect the behaviour to be 
> > > > exactly the same after resume as after a bus reset: the link will fail to 
> > > > negotiate and the LBMS and DLLLA bits will be respectively set and clear.  
> > > > Consequently if you clear LBMS at resume, then the workaround won't 
> > > > trigger and the link will remain inoperational in its limbo state.
> > > 
> > > How do you get the LBMS set in the first place? Isn't that because the 
> > > link tries to come up so shouldn't it reassert that bit again before the 
> > > code ends up into the target speed quirk? That is, I assumed you actually 
> > > wanted to detect LBMS getting set during pcie_wait_for_link_status() call 
> > > preceeding pcie_failed_link_retrain() call?
> 
>  It is a good question what the sequence of events exactly is that sets 
> the LBMS bit.  I don't know the answer offhand.
>
> > > In any case and unrelated to this patch, the way this quirk monopolizes 
> > > LBMS bit is going to have to be changed because it won't be reliable with 
> > > the PCIe BW controller that sets up and irq for LBMS (and clears the bit).
> > > In bwctrl v5 (yet to be posted) I'll add LBMS counter into bwctrl to allow 
> > > this quirk to keep working (which will need to be confirmed).
> 
>  If there's an interrupt handler for LBMS events, then it may be the best 
> approach if the quirk is triggered by the handler instead, possibly as a 
> softirq.

Okay, I'll look into changing the code towards that direction. The small 
trouble there is that soon I've very little that can be configured away 
from the bandwidth controller because this quirk already relates to the 
target speed changing code and the LBMS will require the other side to be 
always compiled too...

> > > >  What kind of scenario does the LBMS bit get set in that you have a 
> > > > trouble with?  You write that in your case the downstream device has been 
> > > > disconnected while the bug hierarchy was suspended and it is not present 
> > > > anymore at resume, is that correct?
> > > >
> > > >  But in that case no link negotiation could have been possible and 
> > > > consequently the LBMS bit mustn't have been set by hardware (according to 
> > > > my understanding of PCIe), because (for compliant, non-broken devices 
> > > > anyway) it is only specified to be set for ports that can communicate with 
> > > > the other link end (the spec explicitly says there mustn't have been a 
> > > > transition through the DL_Down status for the port).
> > > >
> > > >  Am I missing something?
> > > 
> > > Yes, when resuming the device is already gone but the bridge still has 
> > > LBMS set. My understanding is that it was set because it was there
> > > from pre-suspend time but I've not really taken a deep look into it 
> > > because the problem and fix seemed obvious.
> 
>  I've always been confused with the suspend/resume terminology: I'd have 
> assumed this would have gone through a power cycle, in which case the LBMS 
> bit would have necessarily been cleared in the transition, because its 
> required state at power-up/reset is 0, so the value of 1 observed would be 
> a result of what has happened solely through the resume stage.  Otherwise 
> it may make sense to clear the bit in the course of the suspend stage, 
> though it wouldn't be race-free I'm afraid.

I also thought suspend as one possibility but yes, it racy. Mika also 
suggested clearing LBMS after each successful retrain but that wouldn't 
cover all possible ways to get LBMS set as devices can set it 
independently of OS. Keeping it cleared constantly is pretty much what 
will happen with the bandwidth controller anyway.

> > > I read that "without the Port transitioning through DL_Down status" 
> > > differently than you, I only interpret that it relates to the two 
> > > bullets following it. ...So if retrain bit is set, and link then goes 
> > > down, the bullet no longer applies and LBMS should not be set because 
> > > there was transition through DL_Down. But I could well be wrong...
> 
> What I refer to is that if you suspend your system, remove the device 
> that originally caused the quirk to trigger and then resume your system 
> with the device absent,

A small correction here, the quirk didn't trigger initially for the 
device, it does that only after resume. And even then quirk is called 
only because the link doesn't come up.

On longer term, I'd actually want to have hotplug resumed earlier so the 
disconnect could be detected before/while all this waiting related to link 
up is done. But that's very complicated to realize in practice because 
hotplug lurks behind portdrv so resuming it earlier isn't going to be 
about just moving a few lines around.

> then LBMS couldn't have been set in the course of 
> resume, because the port couldn't have come out of the DL_Down status in 
> the absence of the downstream device.  Do you interpret it differently?

That's a good question and I don't have an answer to this yet, that is,
I don't fully understand what happens to those device during runtime 
suspend/resume cycle and what is the exact mechanism that preserves the 
LBMS bit, I'll look more into it.

But I agree that if the device goes cold enough and downstream is 
disconnected, the port should no longer have a way to reassert LBMS.

> Of course once set the bit isn't self-clearing except at power-up/reset.

Okay, I misunderstood you meant it would be self-clearing whenever 
DL_Down happens. I can see that could have been one possible 
interpretation of the text fragment in the spec.

> > So I would be really curious now to know how you get the LBMS on the 
> > device that needs the Target Speed quirk? Is this true (from the commit 
> > a89c82249c37 ("PCI: Work around PCIe link training failures")):
> > 
> > "Instead the link continues oscillating between the two speeds, at the 
> > rate of 34-35 times per second, with link training reported repeatedly 
> > active ~84% of the time."
> > 
> > ?
> 
>  That is what I have observed.  It was so long ago I don't remember how I 
> calculated the figures anymore, it may have been with a U-Boot debug patch 
> made to collect samples (because with U-Boot you can just poll the LT bit 
> while busy-looping).  I'd have to try and dig out the old stuff.

I'd guess it probably sets the bit on each try, or perhaps only on the 
subset of tries which were "successful" before the link almost immediately 
runs into another error (that 16% of the time).

> > Because if it is constantly picking another speed, it would mean you get 
> > LBMS set over and over again, no? If that happens 34-35 times per second, 
> > it should be set already again when we get into that quirk because there 
> > was some wait before it gets called.
> 
>  I'll see if I can experiment with the hardware over the next couple of 
> days and come back with my findings.

Okay thanks.


-- 
 i.

[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux