Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/13/24 15:49, Russell King (Oracle) wrote:
On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
On 3/13/24 14:59, Russell King (Oracle) wrote:
On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
I will try to provide multiple answers for the sake of everyone having the
same context. Responding to Linus' specifically and his suggestion to use
"initcall_debug", this is what it gave me:

[    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
[  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
[  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
[  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
[  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
[  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller

Also got another occurrence happening resuming from suspend to DRAM with:

[   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
[  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs

and also with the PCIe root complex driver:

[   18.266279] brcm-pcie f0460000.pcie: PM: calling
brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
[  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
[  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
[  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
returned 0 after 159190939 usecs
[  177.457257] pcieport 0000:00:00.0: PM: calling
pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00

Surprisingly those drivers are consistently reproducing the failures I am
seeing so at least this gave me a clue as to where the problem is.

There were no changes to drivers/net/ethernet/broadcom/genet/, the two
changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
the read_poll_timeout() conversion is correct, we properly break out of the
loop. The initial delay looked like a good culprit for a little while, but
it is not used on the affected platforms because instead we provide a
callback and we have an interrupt to signal the completion of a MDIO
operation, therefore unimac_mdio_poll() is not used at all. Finally
drivers/memory/brcmstb_dpfe.c also received a single change which is not
functional here (.remove function change do return void).

I went back to a manual bisection and this time I believe that I have a more
plausible candidate with:

7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
hierarchical pull model")

I haven't understood the code there yet, and how it would interact with
arch code, but one thing that immediately jumps out to me is this:

"    As long as a CPU is busy it expires both local and global timers. When a
      CPU goes idle it arms for the first expiring local timer."

So are local timers "armed" when they are enqueued while the cpu is
"busy" during initialisation, and will they expire, and will that
expiry be delivered in a timely manner?

If not, this commit is basically broken, and would be the cause of the
issue you are seeing. For the mdio case, we're talking about 2ms
polling. For the dpfe case, it looks like we're talking about 1ms
sleeps. I'm guessing that these end up being local timers.

Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
for link up every 5ms - if the link was down and we msleep(5) I wonder
if that's triggering the same issue.

Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
say. I would imagine that the same hardware timer driver is being used
(may be worth checking DT.) The same should be true for the interrupt
driver as well. There's been no changes in that code.

I just had it happen with ARM64 I was plagued by:

https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/

and my earlier bisections somehow did not have ARM64 fail, so I thought it
was immune but it fails with about the same failure rate as ARM 32-bit.

Okay, so if that's two architectures failing, it's probably not an
architecture specific issue. It could still be a timer/interrupt driver
issue though. It would be nice to have a reproducer on x86 which would
confirm it.

Do you know the frequency that the failure happens?

On this specific commit 7ee988770326fca440472200c3eb58935fe712f6, there is a 100% failure for at least 3 devices out of the 16 that are running the test.
--
Florian





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux