On Wed, Aug 09, 2023 at 02:35:59PM -0400, Joel Fernandes wrote: > On Wed, Aug 9, 2023 at 12:18 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote: > > > > On 8/9/23 06:53, Joel Fernandes wrote: > > > On Wed, Aug 09, 2023 at 12:40:36PM +0200, Greg Kroah-Hartman wrote: > > >> This is the start of the stable review cycle for the 5.15.126 release. > > >> There are 92 patches in this series, all will be posted as a response > > >> to this one. If anyone has any issues with these being applied, please > > >> let me know. > > >> > > >> Responses should be made by Fri, 11 Aug 2023 10:36:10 +0000. > > >> Anything received after that time might be too late. > > >> > > >> The whole patch series can be found in one patch at: > > >> https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.126-rc1.gz > > >> or in the git tree and branch at: > > >> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y > > >> and the diffstat can be found below. > > > > > > Not necesscarily new with 5.15 stable but 3 of the 19 rcutorture scenarios > > > hang with this -rc: TREE04, TREE07, TASKS03. > > > > > > 5.15 has a known stop machine issue where it hangs after 1.5 hours with cpu > > > hotplug rcutorture testing. Me and tglx are continuing to debug this. The > > > issue does not show up on anything but 5.15 stable kernels and neither on > > > mainline. > > > > > > > Do you by any have a crash pattern that we could possibly use to find the crash > > in ChromeOS crash logs ? No idea if that would help, but it could provide some > > additional data points. > > The pattern shows as a hard hang, the system is unresponsive and all CPUs > are stuck in stop_machine. Sometimes it recovers on its own from the > hang and then RCU immediately gives stall warnings. It takes 1.5 hour > to reproduce and sometimes never happens for several hours. > > It appears related to CPU hotplug since gdb showed me most of the CPUs > are spinning in multi_cpu_stop() / stop machine after the hang. > Hmm, we do see lots of soft lockups with multi_cpu_stop() in the backtrace, but not with v5.15.y but with v5.4.y. The actual hang is in stop_machine_yield(). Example: <0>[63298.624328] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [migration/0:11] <4>[63298.624331] Modules linked in: 8021q ccm snd_seq_dummy snd_seq snd_seq_device bridge stp llc tun nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp esp6 ah6 ip6t_REJECT ip6t_ipv6header vhost_vsock vhost vmw_vsock_virtio_transport_common vsock veth rfcomm xt_cgroup cmac algif_hash algif_skcipher af_alg xt_MASQUERADE uinput iwlmvm snd_soc_skl_ssp_clk iwl7000_mac80211 btusb snd_soc_kbl_da7219_max98357a btrtl btintel snd_soc_hdac_hdmi btbcm bluetooth snd_soc_dmic snd_soc_skl ecdh_generic ecc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_hdac_hda uvcvideo snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core videobuf2_vmalloc videobuf2_v4l2 videobuf2_common snd_intel_dspcfg videobuf2_memops snd_hda_codec snd_hwdep snd_hda_core iwlwifi snd_soc_da7219 snd_soc_max98357a fuse ip6table_nat cfg80211 lzo_rle lzo_compress zram joydev <4>[63298.624357] CPU: 0 PID: 11 Comm: migration/0 Tainted: G U W 5.4.180-17902-g44152654f29b #1 <4>[63298.624358] Hardware name: Google Nami/Nami, BIOS Google_Nami.10775.145.0 09/19/2019 <4>[63298.624363] RIP: 0010:stop_machine_yield+0xb/0xd <4>[63298.624366] Code: ff 74 b6 f0 ff 0f 75 b1 48 83 c7 08 e8 1f cb f9 ff eb a6 e8 a0 20 e3 ff eb bc e8 50 4b f5 ff 0f 1f 44 00 00 55 48 89 e5 f3 90 <5d> c3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 81 <4>[63298.624368] RSP: 0000:ffffbaf90006fe38 EFLAGS: 00000293 ORIG_RAX: ffffffffffffff13 <4>[63298.624370] RAX: 0000000000000000 RBX: ffffbaf90300bca8 RCX: 0000000000000000 <4>[63298.624371] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffffb0d46920 <4>[63298.624373] RBP: ffffbaf90006fe38 R08: 0000000000000002 R09: 0000398ecf9a0ac5 <4>[63298.624374] R10: 0000000000000171 R11: ffffffffaf9cfb11 R12: 0000000000000001 <4>[63298.624376] R13: ffff9b09baa22201 R14: ffffffffb0d46920 R15: 0000000000000001 <4>[63298.624377] FS: 0000000000000000(0000) GS:ffff9b09baa00000(0000) knlGS:0000000000000000 <4>[63298.624379] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[63298.624380] CR2: 0000153c00724820 CR3: 0000000171ab8005 CR4: 00000000003606f0 <4>[63298.624382] Call Trace: <4>[63298.624386] multi_cpu_stop+0x89/0x119 <4>[63298.624389] ? stop_two_cpus+0x24d/0x24d <4>[63298.624391] cpu_stopper_thread+0x8f/0x111 <4>[63298.624394] smpboot_thread_fn+0x174/0x212 <4>[63298.624397] kthread+0x147/0x156 <4>[63298.624399] ? cpu_report_death+0x43/0x43 <4>[63298.624401] ? kthread_blkcg+0x2e/0x2e <4>[63298.624404] ret_from_fork+0x35/0x40 <0>[63298.624407] Kernel panic - not syncing: softlockup: hung tasks I guess that is something different ? Guenter