On 22/01/24 11:22, Kishon Vijay Abraham I wrote: > +Siddharth > > Hi Diogo, > > On 1/12/2024 5:16 PM, Diogo Ivo wrote: >> >> On 1/12/24 07:57, Greg KH wrote: >>> On Thu, Jan 11, 2024 at 02:13:30PM +0000, Diogo Ivo wrote: >>>> Hello, >>>> >>>> When testing the IOT2050 Advanced M.2 platform with Linux CIP 6.1 >>>> we came across a breakage in the probing of the Keystone PCI driver >>>> (drivers/phy/ti/pci-keystone.c). This probing was working correctly >>>> in the previous version we were using, v5.10. >>>> >>>> In order to debug this we changed over to mainline Linux and bissecting >>>> lead us to find that commit e611f8cd8717 is the culprit, and with it applied >>>> we get the following messages: >>>> >>>> [ 10.954597] phy-am654 910000.serdes: Failed to enable PLL >>>> [ 10.960153] phy phy-910000.serdes.3: phy poweron failed --> -110 >>>> [ 10.967485] keystone-pcie 5500000.pcie: failed to enable phy >>>> [ 10.973560] keystone-pcie: probe of 5500000.pcie failed with error -110 >>>> >>>> This timeout is occuring in serdes_am654_enable_pll(), called from the >>>> phy_ops .power_on() hook. >>>> >>>> Due to the nature of the error messages and the contents of the commit we >>>> believe that this is due to an unidentified race condition in the probing of >>>> the Keystone PCI driver when enabling the PHY PLLs, since changes in the >>>> workqueue the deferred probing runs on should not affect if probing works >>>> or not. To further support the existence of a race condition, commit >>>> 86bfbb7ce4f6 (a scheduler commit) fixes probing, most likely unintentionally >>>> meaning that the problem may arise in the future again. >>>> >>>> One possible explanation is that there are pre-requisites for enabling the PLL >>>> that are not being met when e611f8cd8717 is applied; to see if this is the case >>>> help from people more familiar with the hardware details would be useful. >>>> >>>> As official support specifically for the IOT2050 Advanced M.2 platform was >>>> introduced in Linux v6.3 (so in the middle of the commits mentioned above) >>>> all of our testing was done with the latest mainline DeviceTree with [1] >>>> applied on top. >>>> >>>> This is being reported as a regression even though technically things are >>>> working with the current state of mainline since we believe the current fix >>>> to be an unintended by-product of other work. >>>> >>>> #regzbot introduced: e611f8cd8717 >>> A "regression" for a commit that was in 5.13, i.e. almost 2 years ago, >>> is a bit tough, and not something I would consider really a "regression" >>> as it is core code that everyone runs. Given you point at scheduler >>> changes also fixing the issue, this seems like a hint as to what is >>> wrong with your driver/platform, but is not the root cause of it and >>> needs to be resolved. Please look at fixing it in your drivers? Are >>> they all in Linus's tree? >>> >>> thanks, >>> >>> greg k-h >> Hello, >> >> I see the point that this code has been living in the kernel for a >> long time now and that it becomes more difficult to justify it as >> a regression; I reported it as such based on the supposition that >> the current fix is not the proper one and that technically this >> support was broken between the identified commits. >> >> If this situation is incompatible with a regression report then it >> can be dropped as one and we keep it is as a bug report for which >> we are looking for input from the community. >> >> I agree that this needs to be fixed in the driver since all other >> drivers are working fine with e611f8cd8717, and yes, all of the >> drivers in question are in mainline, where we performed the bissection. > > Looks like Siddharth from TI fixed a similar issue reported by you here. > https://lore.kernel.org/r/20230927041845.1222080-1-s-vadapalli@xxxxxx Kishon, Thank you for looping me in. Diogo, The issue you are referring to is identical to the one fixed in the patch shared above by Kishon. I had also bisected the culprit to commit e611f8cd8717 which doesn't have anything to do with PCIe/Serdes in particular. It only seems to expose the underlying race condition which has always existed. The fix has been merged and is now a part of mainline Linux: https://github.com/torvalds/linux/commit/c12ca110c613a81cb0f0099019c839d078cd0f38 Additionally, you might run into an issue once you fix the above which happens to be a 45 second delay when no Endpoint Device is connected to the PCIe connector. I have posted a patch for fixing that issue as well: https://lore.kernel.org/r/20231019081330.2975470-1-s-vadapalli@xxxxxx/ -- Regards, Siddharth.