On 3/14/24 03:41, Frederic Weisbecker wrote:
On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
On 3/13/24 14:59, Russell King (Oracle) wrote:
On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
I will try to provide multiple answers for the sake of everyone having the
same context. Responding to Linus' specifically and his suggestion to use
"initcall_debug", this is what it gave me:
[ 6.970669] ata1: SATA link down (SStatus 0 SControl 300)
[ 166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
[ 166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
[ 166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
[ 166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
[ 166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
Also got another occurrence happening resuming from suspend to DRAM with:
[ 22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
[ 181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
and also with the PCIe root complex driver:
[ 18.266279] brcm-pcie f0460000.pcie: PM: calling
brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
[ 177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
[ 177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
[ 177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
returned 0 after 159190939 usecs
[ 177.457257] pcieport 0000:00:00.0: PM: calling
pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
Surprisingly those drivers are consistently reproducing the failures I am
seeing so at least this gave me a clue as to where the problem is.
There were no changes to drivers/net/ethernet/broadcom/genet/, the two
changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
the read_poll_timeout() conversion is correct, we properly break out of the
loop. The initial delay looked like a good culprit for a little while, but
it is not used on the affected platforms because instead we provide a
callback and we have an interrupt to signal the completion of a MDIO
operation, therefore unimac_mdio_poll() is not used at all. Finally
drivers/memory/brcmstb_dpfe.c also received a single change which is not
functional here (.remove function change do return void).
I went back to a manual bisection and this time I believe that I have a more
plausible candidate with:
7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
hierarchical pull model")
I haven't understood the code there yet, and how it would interact with
arch code, but one thing that immediately jumps out to me is this:
" As long as a CPU is busy it expires both local and global timers. When a
CPU goes idle it arms for the first expiring local timer."
So are local timers "armed" when they are enqueued while the cpu is
"busy" during initialisation, and will they expire, and will that
expiry be delivered in a timely manner?
If not, this commit is basically broken, and would be the cause of the
issue you are seeing. For the mdio case, we're talking about 2ms
polling. For the dpfe case, it looks like we're talking about 1ms
sleeps. I'm guessing that these end up being local timers.
Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
for link up every 5ms - if the link was down and we msleep(5) I wonder
if that's triggering the same issue.
Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
say. I would imagine that the same hardware timer driver is being used
(may be worth checking DT.) The same should be true for the interrupt
driver as well. There's been no changes in that code.
I just had it happen with ARM64 I was plagued by:
https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
and my earlier bisections somehow did not have ARM64 fail, so I thought it
was immune but it fails with about the same failure rate as ARM 32-bit.
Can you please boot with:
trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
And add the following and give us the resulting output in dmesg?
Here are two logs from two different systems that exposed the problem on
boot:
https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
events. My fault, can you retry with this syntax?
trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
Though it's fairly possible that timer migration is not enabled at this point
as it's a late initcall. But we better not miss its traces otherwise.
Here is another log with timer_migration:
https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
--
Florian