Re: [REGRESSION] Keystone PCI driver probing and SerDes PLL timeout

"Linux regression tracking (Thorsten Leemhuis)" <regressions@xxxxxxxxxxxxx> · Fri, 12 Jan 2024 13:51:02 +0100

On 12.01.24 12:46, Diogo Ivo wrote:
> On 1/12/24 07:57, Greg KH wrote:
>> On Thu, Jan 11, 2024 at 02:13:30PM +0000, Diogo Ivo wrote:
>>>
>>> When testing the IOT2050 Advanced M.2 platform with Linux CIP 6.1
>>> we came across a breakage in the probing of the Keystone PCI driver
>>> (drivers/phy/ti/pci-keystone.c). This probing was working correctly
>>> in the previous version we were using, v5.10.
>>>
>>> In order to debug this we changed over to mainline Linux and bissecting
>>> lead us to find that commit e611f8cd8717 is the culprit, and with it
>>> applied
>>> we get the following messages:
>>>
>>> [   10.954597] phy-am654 910000.serdes: Failed to enable PLL
>>> [   10.960153] phy phy-910000.serdes.3: phy poweron failed --> -110
>>> [   10.967485] keystone-pcie 5500000.pcie: failed to enable phy
>>> [   10.973560] keystone-pcie: probe of 5500000.pcie failed with error
>>> -110
>>>
>>> This timeout is occuring in serdes_am654_enable_pll(), called from the
>>> phy_ops .power_on() hook.
>>>
>>> Due to the nature of the error messages and the contents of the
>>> commit we
>>> believe that this is due to an unidentified race condition in the
>>> probing of
>>> the Keystone PCI driver when enabling the PHY PLLs, since changes in the
>>> workqueue the deferred probing runs on should not affect if probing
>>> works
>>> or not. To further support the existence of a race condition, commit
>>> 86bfbb7ce4f6 (a scheduler commit) fixes probing, most likely
>>> unintentionally
>>> meaning that the problem may arise in the future again.
>>>
>>> One possible explanation is that there are pre-requisites for
>>> enabling the PLL
>>> that are not being met when e611f8cd8717 is applied; to see if this
>>> is the case
>>> help from people more familiar with the hardware details would be
>>> useful.
>>>
>>> As official support specifically for the IOT2050 Advanced M.2
>>> platform was
>>> introduced in Linux v6.3 (so in the middle of the commits mentioned
>>> above)
>>> all of our testing was done with the latest mainline DeviceTree with [1]
>>> applied on top.
>>>
>>> This is being reported as a regression even though technically things
>>> are
>>> working with the current state of mainline since we believe the
>>> current fix
>>> to be an unintended by-product of other work.
>>>
>>> #regzbot introduced: e611f8cd8717
>> A "regression" for a commit that was in 5.13, i.e. almost 2 years ago,
>> is a bit tough, and not something I would consider really a "regression"
>> as it is core code that everyone runs.

In case anyone cares: I'd call it a regression, but not one where the
"no regressions" rule applies. Primarily because the regression does not
 happen anymore in mainline (even if that was by chance). Furthermore I
think Linus once said that regressions become just bugs when they are
reported only after quite some time. Not sure if that is the case here,
but whatever, due to the previous point discussing this is moot. That's
why I'll drop it from the tracking:

#regzbot inconclusive: fixed in mainline (maybe accidentally), but not
in stable

> I see the point that this code has been living in the kernel for a
> long time now and that it becomes more difficult to justify it as
> a regression; I reported it as such based on the supposition that
> the current fix is not the proper one and that technically this
> support was broken between the identified commits.

True, but it apparently also doesn't happen anymore in mainline. To
technically nobody is afaics obliged to look into this afaics.

Of course it still would be nice to resolve this in the affected
longterm series. And hey, with a bit of luck this thread might lead to a
fix one way or another. I hope that will be the case!

> [...]

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.