On Thu, Jan 11, 2024 at 02:13:30PM +0000, Diogo Ivo wrote: > Hello, > > When testing the IOT2050 Advanced M.2 platform with Linux CIP 6.1 > we came across a breakage in the probing of the Keystone PCI driver > (drivers/phy/ti/pci-keystone.c). This probing was working correctly > in the previous version we were using, v5.10. > > In order to debug this we changed over to mainline Linux and bissecting > lead us to find that commit e611f8cd8717 is the culprit, and with it applied > we get the following messages: > > [ 10.954597] phy-am654 910000.serdes: Failed to enable PLL > [ 10.960153] phy phy-910000.serdes.3: phy poweron failed --> -110 > [ 10.967485] keystone-pcie 5500000.pcie: failed to enable phy > [ 10.973560] keystone-pcie: probe of 5500000.pcie failed with error -110 > > This timeout is occuring in serdes_am654_enable_pll(), called from the > phy_ops .power_on() hook. > > Due to the nature of the error messages and the contents of the commit we > believe that this is due to an unidentified race condition in the probing of > the Keystone PCI driver when enabling the PHY PLLs, since changes in the > workqueue the deferred probing runs on should not affect if probing works > or not. To further support the existence of a race condition, commit > 86bfbb7ce4f6 (a scheduler commit) fixes probing, most likely unintentionally > meaning that the problem may arise in the future again. > > One possible explanation is that there are pre-requisites for enabling the PLL > that are not being met when e611f8cd8717 is applied; to see if this is the case > help from people more familiar with the hardware details would be useful. > > As official support specifically for the IOT2050 Advanced M.2 platform was > introduced in Linux v6.3 (so in the middle of the commits mentioned above) > all of our testing was done with the latest mainline DeviceTree with [1] > applied on top. > > This is being reported as a regression even though technically things are > working with the current state of mainline since we believe the current fix > to be an unintended by-product of other work. > > #regzbot introduced: e611f8cd8717 A "regression" for a commit that was in 5.13, i.e. almost 2 years ago, is a bit tough, and not something I would consider really a "regression" as it is core code that everyone runs. Given you point at scheduler changes also fixing the issue, this seems like a hint as to what is wrong with your driver/platform, but is not the root cause of it and needs to be resolved. Please look at fixing it in your drivers? Are they all in Linus's tree? thanks, greg k-h