Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/12/24 14:07, Boqun Feng wrote:
On Tue, Mar 12, 2024 at 01:32:03PM -0700, Florian Fainelli wrote:
Hi Boqun,

On 3/8/24 09:15, Boqun Feng wrote:
Hi Linus,

Please pull this for the RCU changes of v6.9:

The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:

    Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)

are available in the Git repository at:

    git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9

for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:

    Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)


Two merge conflicts were detected by linux-next:

* https://lore.kernel.org/lkml/20240226135745.12ac854d@xxxxxxxxxxxxxxxx/
* https://lore.kernel.org/lkml/20240227125522.2bdbe6be@xxxxxxxxxxxxxxxx/

These conflict resolutions from linux-next look good to me, plus I made
my own resolutions at branch merge/rcu.2024.02.27a for your reference.


Some highlights of the changes:

* Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
    Instead of SRCU read side critical sections, now a percpu list is used
    in do_exit() for scaning yet-to-exit tasks.

* Fixes a deadlock due to the dependency between workqueue and RCU
    expedited grace period, reported by Anna-Maria Behnsen and Thomas
    Gleixner and fixed by Frederic: Now RCU expedited always uses its own
    kthread worker instead of a workqueue.

At least one device in my test farm (ARM 32-bit) has consistently shown a
very long boot, and some others are intermittently affected. This
consistently looks like this on most of my devices:

[    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
[    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
[  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus

this gets flagged by my boot script as a boot failure since we exceeded the
30 seconds timeout given to boot a kernel to a prompt.

It has been somewhat difficult to get a reliable bisection going on, but
what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 does
not expose the problem for 10 consecutive boots, while I *might* see it at
e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.

Any clues what is going on here?


Could you share the config file and your kernel parameters? Also could
you share the whole log? Thanks!

Here is the configuration file:

https://gist.github.com/ffainelli/f3fd38752a186cee0cb8719089c0b6b8

and here is a log where this fails:

https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
--
Florian





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux