Hi Rolf, On 31.08.20 18:21, Rolf Eike Beer wrote: > These things are in 5.8.4 AFAICT, and the lockups are still there: Thanks for testing! > [320616.602705] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [hppa2.0-unknown:29093] > [320616.602705] Modules linked in: 8021q ipmi_poweroff ipmi_si ipmi_devintf sata_via ipmi_msghandler cbc dm_zero dm_snapshot dm_mirror dm_region_hash dm_log dm_crypt dm_bufio pata_sil680 libata > [320616.602705] CPU: 0 PID: 29093 Comm: hppa2.0-unknown Not tainted 5.8.4-gentoo-parisc64 #1 > [320616.602705] Hardware name: 9000/785/C8000 >... > [320616.602705] IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402706d0 00000000402706d4 > [320616.602705] IIR: 0ff0109c ISR: 000000005836f8a0 IOR: 0000000000000001 > [320616.602705] CPU: 0 CR30: 0000004083878000 CR31: ffffffffffffffff > [320616.602705] ORIG_R28: 0000000000000801 > [320616.602705] IAOQ[0]: smp_call_function_many_cond+0x490/0x500 > [320616.602705] IAOQ[1]: smp_call_function_many_cond+0x494/0x500 > [320616.602705] RP(r2): smp_call_function_many_cond+0x468/0x500 > [320616.602705] Backtrace: > [320616.602705] [<0000000040270824>] on_each_cpu+0x5c/0x98 > [320616.602705] [<0000000040186a0c>] flush_tlb_all+0x204/0x228 > [320616.602705] [<00000000402ef1f8>] tlb_finish_mmu+0x1d8/0x210 > [320616.602705] [<00000000402eb820>] exit_mmap+0x1d8/0x370 > [320616.602705] [<00000000401b5ec0>] mmput+0xe8/0x260 > [320616.602705] [<00000000401c1690>] do_exit+0x558/0x12e8 > [320616.602705] [<00000000401c3f18>] do_group_exit+0x50/0x118 > [320616.602705] [<00000000401c4000>] sys_exit_group+0x20/0x28 > [320616.602705] [<0000000040192018>] syscall_exit+0x0/0x14 I agree. I have seen the same stall too. I think we should try to analyze how the stall in smp_call_function_many_cond() can happen. The trace seems always to point to do_exit(). I think those patches from Linus helped for the "old kind of stalls" which we have had in the last months/years: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6fe44d96fc1536af5b11cd859686453d1b7bfd1 and https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a9127fcf2296674d58024f83981f40b128fffea Those old stalls were something like this and didn't pointed to do_exit(): [111395.307021] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [111395.311001] rcu: 3-...0: (1 GPs behind) idle=04e/1/0x4000000000000000 softirq=13650053/13650054 fqs=2625 [111395.311001] (detected by 0, t=5252 jiffies, g=25258025, q=1240) [111395.311001] Task dump for CPU 3: [111395.311001] init R running task 0 1 0 0x00000016 [111395.311001] Backtrace: [111395.311001] [<0000000040416110>] hrtimer_try_to_cancel+0x13c/0x1f8 [111395.311001] [<0000000040e65a18>] schedule_hrtimeout_range_clock+0x10c/0x1b8 [111395.311001] [<0000000040e65af4>] schedule_hrtimeout_range+0x30/0x60 [111395.311001] [<0000000040e5ebbc>] _cond_resched+0x40/0xb8 [111395.311001] [<00000000403617a4>] get_signal+0x348/0xf00 [111395.311001] [<000000004031cec0>] do_signal+0x54/0x230 [111395.311001] [<000000004031d0f8>] do_notify_resume+0x5c/0x164 [111395.311001] [<0000000040310110>] syscall_do_signal+0x54/0xa0 [111395.311001] [<000000004030f074>] intr_return+0x0/0xc [111395.311001] [111458.330562] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [111458.330562] rcu: 3-...0: (1 GPs behind) idle=04e/1/0x4000000000000000 softirq=13650053/13650054 fqs=4653 [111458.330562] (detected by 0, t=21007 jiffies, g=25258025, q=1361) [111458.330562] Task dump for CPU 3: [111458.330562] init R running task 0 1 0 0x00000016 ... Helge
Attachment:
signature.asc
Description: OpenPGP digital signature