Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)

Jon Kohler <jon@xxxxxxxxxxx> · Wed, 28 Aug 2024 02:41:09 +0000

> On Aug 27, 2024, at 9:21 PM, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> 
> On Tue, Aug 27, 2024 at 11:33:32PM +0000, Jon Kohler wrote:
>> 
>> 
>>> On Aug 27, 2024, at 4:34 PM, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>>> 
>>> On Tue, Aug 27, 2024 at 08:01:27PM +0000, Jon Kohler wrote:
>>>> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
>>>> Reaching out about a tricky hung task issue that I'm running into. I've
>>>> got a virtualized Linux guest on top of a KVM based platform, running
>>>> a 5.10.y based kernel. The issue we're running into is a hung task that
>>>> *only* happens on shutdown/reboot of this particular VM once every 
>>>> 20-50 times.
>>>> 
>>>> The signature of the hung task is always similar to the output below,
>>>> where we appear to hang on the call to 
>>>>   synchronize_srcu(&fsnotify_mark_srcu)
>>> 
>>> One thing to try would be to add trace_printk() or similar to the SRCU
>>> readers, just in case someone was using srcu_read_lock_notrace() on
>>> fsnotify_mark_srcu, which I see no trace of in current mainline.
>>> 
>>> Alternatively, if there is a version where this does not happen, try
>>> bisecting.  Each bisection step would require something like 400-500
>>> shutdown/reboots to prove the commit good.  (Obviously, the first failure
>>> proves the commit bad, which for one-out-of-50 failures will take on
>>> average about 35 shutdown/reboots.)
>>> 
>>> There could also be a bad SRCU backport from mainline, so please check
>>> what SRCU backports you have in your 5.10.y stable release.  (Though
>>> maybe Jack has already done this?)
>>> 
>>> Thanx, Paul
>> 
>> Thanks, Paul
>> 
>> For posterity, this kernel is just built off of the regular ole stable tree, and here’s
>> All of the backports to kernel/rcu are below.
>> 
>> Stepping through this more, since we’re stalling at wait_for_completion,
>> that must mean that wakeme_after_rcu() ... complete(&rcu->completion)
>> is not happening, right?
>> 
>> That or somehow wakeme_after_rcu() fires *before* wait_for_completion()
>> is setup, causing the wait to hang forever?
> 
> There is no problem with that misordering -- in that case, the call to
> wait_for_completion() simply won't wait.
> 
> But that would be a good path to trace.  After all, you might have
> noticed that what we believe software will do does not always match what
> it actually does.

Agreed, I’ll noodle on it some more, and try to bisect as well. I see 
Neeraj responded separately, I’ll check that out now.
> 
> 
>> Is it possible for wakeme_after_rcu() to race, such that it fires somewhere
>> else *before* wait_for_completion gets all the way to schedule()?
> 
> That should also work just fine, but it is still a good assumption to
> check.
> 
> Good list below, but I must focus on mainline.  Apologies!
> 
> Plus this assumes that v5.10 worked for you -- has that been tested?
> So again, what version has worked for you?

We’ve had this particular service on 5.10.y for a while; however, this
behavior just recently started bubbling up with a somewhat new(er) set
of internal QA tests that very specifically stress shutdown/reboots in a
loop for a completely unrelated durability test, so its possible this issue
has existed for a while.

> 
> Thanx, Paul
> 
>> [rcu]$ pwd
>> /kernel/kernel/rcu
>> [rcu]$ git remote -v
>> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e=  (fetch)
>> origin https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_stable_linux.git&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=1Tj05_-3u5bekAXIOn3TZBxlsLepxSWdhZVB6OKs02CYfxx4O9XkBJaihu2H2SaF&s=i-EJON2mPUDM-dvijeXkVr4nsR1g_obFaq9F9DCfQNs&e=  (push)
>> [rcu]$ git log --oneline 2c85ebc57b3e..HEAD .
>> ca4427ebc626 (HEAD, tag: v5.10.205) Linux 5.10.205 <<<< this is the base commit for 5.10.205 >>>>
>> ...
>> 175f4b062f69 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects <<<< this is the most recent backport commit to kernel/rcu >>>>
>> 55887adc76e1 rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()
>> 066fbd8bc981 refscale: Fix uninitalized use of wait_queue_head_t
>> d93ba6e46e5f rcu-tasks: Add trc_inspect_reader() checks for exiting critical section
>> 3e22624f8fd3 rcu-tasks: Wait for trc_read_check_handler() IPIs
>> 9190c1f0aed1 rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
>> ad4f8c117b8b rcu: Prevent expedited GP from enabling tick on offline CPU
>> 4f91de9a81bd rcu-tasks: Simplify trc_read_check_handler() atomic operations
>> 3a64cd01cdd6 rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
>> 058f077d09ba rcu-tasks: Mark ->trc_reader_nesting data races
>> 604d6a5ff718 rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
>> d414e24d1509 rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
>> ecc5e6dbc269 rcuscale: Move shutdown from wait_event() to wait_event_idle()
>> b62c816bdb5e rcuscale: Always log error message
>> 8cd9917c13a7 rcuscale: Console output claims too few grace periods
>> 7230a9e599d3 rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
>> a7d21b858589 rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
>> e4842de4ec13 refscale: Move shutdown from wait_event() to wait_event_idle()
>> eb18bc5a8678 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
>> d99d194e2f8c rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
>> 2bf501f1bc78 rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()
>> 1c37e86a78c2 rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()
>> ad410f64f7ab rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() calls
>> b02b6bb83c68 rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purpose
>> 7c15d7ecce00 rcu: Prevent lockdep-RCU splats on lock acquisition/release
>> 5a52380b8193 rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
>> 0dd025483f15 rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE()
>> 36d4ffbedff7 rcu: Back off upon fill_page_cache_func() allocation failure
>> 10f30cba8f6c rcu: Make TASKS_RUDE_RCU select IRQ_WORK
>> 1c6c3f233664 rcu-tasks: Fix race in schedule and flush work
>> a22d66eb518f rcu: Apply callbacks processing time limit only on softirq
>> 40fb3812d997 rcu: Fix callbacks processing time limit retaining cond_resched()
>> fcc9797d0d13 rcu: Don't deboost before reporting expedited quiescent state
>> 0c145262ac99 rcu/nocb: Fix missed nocb_timer requeue
>> 657991fb06a4 rcu: Do not report strict GPs for outgoing CPUs
>> 12d3389b7af6 rcu: Tighten rcu_advance_cbs_nowake() checks
>> 0836f9404017 rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
>> 70692b06208c rcu: Mark accesses to rcu_state.n_force_qs
>> af756be29c82 rcu: Always inline rcu_dynticks_task*_{enter,exit}()
>> 226d68fb6c0a rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
>> 02ddf26d849d rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
>> 7f43cda650d5 rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
>> d3ca78775db4 rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
>> 497f3d9c3f58 rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
>> ea5e5bc881a4 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
>> 527b56d7856f rcu: Fix to include first blocked task in stall warning
>> 4b680b3fc6f3 rcu/tree: Handle VM stoppage in stall detection
>> b6ae3854075e srcu: Provide polling interfaces for Tiny SRCU grace periods
>> 450948b06ce8 srcu: Make Tiny SRCU use multi-bit grace-period counter
>> 641e1d88404a srcu: Provide internal interface to start a Tiny SRCU grace period
>> f789de3be808 srcu: Provide polling interfaces for Tree SRCU grace periods
>> fdf66e5a7fc8 srcu: Provide internal interface to start a Tree SRCU grace period
>> 86cb49e7314e rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
>> 55ddab2bfd70 rcu-tasks: Don't delete holdouts within trc_inspect_reader()
>> 35a35909ec19 rcu: Reject RCU_LOCKDEP_WARN() false positives
>> 23597afbe096 srcu: Fix broken node geometry after early ssp init
>> 728f23e53c65 rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
>> 7d81aff28953 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
>> 09a27d662006 kvfree_rcu: Use same set of GFP flags as does single-argument
>> e713bdd791ba rcu/nocb: Perform deferred wake up before last idle's need_resched() check
>> 20b7669fa3f0 rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
>> 30b491e2b6cc rcu-tasks: Move RCU-tasks initialization to before early_initcall()
>> 9b81af9c8455 rcu/tree: Defer kvfree_rcu() allocation to a clean context
>> 5cacd18c5207 rcu,ftrace: Fix ftrace recursion
>> 4540e84bd8a9 rcu: Allow rcu_irq_enter_check_tick() from NMI <<<< this is the first kernel/rcu backport for 5.10.y branch >>>>
>> ...
>> 2c85ebc57b3e (tag: v5.10) Linux 5.10 <<<< this is the divergence point from mainline to 5.10.y >>>> 
>> 
>>> 
>>>> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
>>>> where two kernel threads are both calling synchronize_srcu, then
>>>> scheduling out in wait_for_completion, and completely going out to
>>>> lunch for over 4 minutes. This then triggers the hung task timeout and
>>>> things blow up.
>>>> 
>>>> We are running audit=1 for this system and are using an el8 based
>>>> userspace.
>>>> 
>>>> I've flipped through the fs/notify code base for both 5.10 as well as
>>>> upstream mainline to see if something jumped off the page, and I
>>>> haven't yet spotted any particular suspect code from the caller side.
>>>> 
>>>> This hang appears to come up at the very end of the shutdown/reboot
>>>> process, seemingly after the system starts to unwind through initrd.
>>>> 
>>>> What I'm working on now is adding some instrumentation to the dracut
>>>> shutdown initrd scripts to see if I can how far we get down that path
>>>> before the system fails to make forward progress, which may give some
>>>> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
>>>> a plethora of DEBUG options [2], and didn't get anything interesting.
>>>> To be clear, we haven't seen lockdep spit out any complaints as of yet.
>>>> 
>>>> Reaching out to see if this sounds familar to anyone on the list, or if
>>>> there are any particular areas of the RCU code base that might be
>>>> suspect for this kind of issue. I'm happy to provide more information,
>>>> as frankly, I'm quite stumped at the moment.
>>>> 
>>>> Thanks all,
>>>> Jon
>>>> 
>>>> [1] panic trace
>>>>   Normal shutdown process, then hangs on the following:
>>>>   ...
>>>>   dracut Warning: Killing all remaining processes
>>>>   ...
>>>>   INFO: task kworker/u20:7:1200701 blocked for more than 241 seconds.
>>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>   task:kworker/u20:7   state:D stack:    0 pid:1200701 ppid:     2 flags:0x00004080
>>>>   Workqueue: events_unbound fsnotify_connector_destroy_workfn
>>>>   Call Trace:
>>>>    __schedule+0x267/0x790
>>>>    schedule+0x3c/0xb0
>>>>    schedule_timeout+0x219/0x2b0
>>>>    wait_for_completion+0x9e/0x100
>>>>    __synchronize_srcu.part.24+0x83/0xb0
>>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
>>>>    ? synchronize_srcu+0x5d/0xf0
>>>>    fsnotify_connector_destroy_workfn+0x46/0x80
>>>>    process_one_work+0x1fc/0x390
>>>>    worker_thread+0x2d/0x3e0
>>>>    ? process_one_work+0x390/0x390
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>>   INFO: task kworker/u20:8:1287360 blocked for more than 241 seconds.
>>>>         Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>   task:kworker/u20:8   state:D stack:    0 pid:1287360 ppid:     2 flags:0x00004080
>>>>   Workqueue: events_unbound fsnotify_mark_destroy_workfn
>>>>   Call Trace:
>>>>    __schedule+0x267/0x790
>>>>    schedule+0x3c/0xb0
>>>>    schedule_timeout+0x219/0x2b0
>>>>    ? add_timer+0x14a/0x200
>>>>    wait_for_completion+0x9e/0x100
>>>>    __synchronize_srcu.part.24+0x83/0xb0
>>>>    ? __bpf_trace_rcu_utilization+0x10/0x10
>>>>    fsnotify_mark_destroy_workfn+0x77/0xe0
>>>>    process_one_work+0x1fc/0x390
>>>>    ? process_one_work+0x390/0x390
>>>>    worker_thread+0x2d/0x3e0
>>>>    ? process_one_work+0x390/0x390
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>>   Kernel panic - not syncing: hung_task: blocked tasks
>>>>   CPU: 1 PID: 64 Comm: khungtaskd Kdump: loaded Tainted: G           O      5.10.205-2.el8.x86_64 #1
>>>>   Hardware name: Red Hat KVM, BIOS 20230302.1.2662.el8 04/01/2014
>>>>   Call Trace:
>>>>    dump_stack+0x6d/0x8c
>>>>    panic+0x114/0x2ea
>>>>    watchdog.cold.8+0xb5/0xb5
>>>>    ? hungtask_pm_notify+0x50/0x50
>>>>    kthread+0x114/0x130
>>>>    ? kthread_park+0x80/0x80
>>>>    ret_from_fork+0x1f/0x30
>>>> 
>>>> [2] additional debugging config knobs turned up.
>>>>   CONFIG_PROVE_LOCKING=y
>>>>   CONFIG_LOCK_STAT=y
>>>>   CONFIG_DEBUG_RT_MUTEXES=y
>>>>   CONFIG_DEBUG_SPINLOCK=y
>>>>   CONFIG_DEBUG_MUTEXES=y
>>>>   CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
>>>>   CONFIG_DEBUG_RWSEMS=y
>>>>   CONFIG_DEBUG_LOCK_ALLOC=y
>>>>   CONFIG_LOCKDEP=y
>>>>   CONFIG_LOCKDEP_BITS=15
>>>>   CONFIG_LOCKDEP_CHAINS_BITS=16
>>>>   CONFIG_LOCKDEP_STACK_TRACE_BITS=19
>>>>   CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
>>>>   CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
>>>>   CONFIG_DEBUG_SHIRQ=y
>>>>   CONFIG_WQ_WATCHDOG=y
>>>>   CONFIG_DEBUG_ATOMIC_SLEEP=y
>>>>   CONFIG_DEBUG_LIST=y
>>>>   CONFIG_DEBUG_PLIST=y
>>>>   CONFIG_DEBUG_SG=y
>>>>   CONFIG_DEBUG_NOTIFIERS=y
>>>>   CONFIG_BUG_ON_DATA_CORRUPTION=y