Re: SRCU hung task on 5.10.y on synchronize_srcu(&fsnotify_mark_srcu)

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 4 Sep 2024 12:51:51 -0700

On Wed, Sep 04, 2024 at 02:40:07PM +0000, Jon Kohler wrote:
> 
> 
> > On Sep 4, 2024, at 5:19 AM, Jan Kara <jack@xxxxxxx> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > On Tue 27-08-24 20:01:27, Jon Kohler wrote:
> >> Hey Paul, Lai, Josh, and the RCU list and Jan/FS list -
> >> Reaching out about a tricky hung task issue that I'm running into. I've
> >> got a virtualized Linux guest on top of a KVM based platform, running
> >> a 5.10.y based kernel. The issue we're running into is a hung task that
> >> *only* happens on shutdown/reboot of this particular VM once every 
> >> 20-50 times.
> >> 
> >> The signature of the hung task is always similar to the output below,
> >> where we appear to hang on the call to 
> >>    synchronize_srcu(&fsnotify_mark_srcu)
> >> in fsnotify_connector_destroy_workfn / fsnotify_mark_destroy_workfn,
> >> where two kernel threads are both calling synchronize_srcu, then
> >> scheduling out in wait_for_completion, and completely going out to
> >> lunch for over 4 minutes. This then triggers the hung task timeout and
> >> things blow up.
> > 
> > Well, the most obvious reason for this would be that some process is
> > hanging somewhere with fsnotify_mark_srcu held. When this happens, can you
> > trigger sysrq-w in the VM and send here its output?
> 
> Jan - Thanks for the ping, that is *exactly* what is happening here.
> Some developments since my last note, the patch Neeraj pointed out
> wasn't the issue, but rather a confluence of realtime thread configurations
> that ended up completely starving whatever CPU was processing per-CPU
> callbacks. So, one thread would go out to lunch completely, and it would
> just never yield. This particular system was configured with RT_RUNTIME_SHARE
> unfortunately, so that realtime thread going out to lunch ate the entire system.
> 
> What was odd is that this never, ever happened during runtime on some
> of these systems that have been up for years and getting beat up heavily,
> but rather only on shutdown. We’ve got more to chase down internally on
> that.
> 
> One thing I wanted to bring up here though while I have you, I have
> noticed through various hits on google, mailing lists, etc over the years that
> this specific type of lockup with fsnotify_mark_srcu seems to happen now
> and then for various oddball reasons, with various root causes. 
> 
> It made me think that I wonder if there is a better structure that could be
> used here that might be a bit more durable. To be clear, I’m not saying that
> SRCU *is not* durable or anything of the sort (I promise!) but rather
> wondering if there was anything we could think about tweaking on the
> fsnotify side of the house to be more efficient.
> 
> Thoughts?

For RCU in real-time environments, we have RCU priority boosting, which
boost RCU readers that have been preempted for too long.  However, this is
SRCU, in which readers can simply block, in addition to being preempted.
Of course, boosting the priority of a task that ha blocked (as opposed
to being preempted) cannot help -- the task will remain blocked until
awakened, regardless of its priority.

But your case takes this one step further, in that the workqueue invoking
callbacks is being preempted and starved, correct?

The usual advice is to make sure that your housekeeping CPUs get
sufficient CPU time.  Easy to say, easy to do, harder to keep done
uniformly across a large number of systems running diverse workloads.
Still, this is the preferred approach.

Just out of curiosity, is this a CONFIG_PREEMPT_RT kernel?

							Thanx, Paul

> >> We are running audit=1 for this system and are using an el8 based
> >> userspace.
> >> 
> >> I've flipped through the fs/notify code base for both 5.10 as well as
> >> upstream mainline to see if something jumped off the page, and I
> >> haven't yet spotted any particular suspect code from the caller side.
> >> 
> >> This hang appears to come up at the very end of the shutdown/reboot
> >> process, seemingly after the system starts to unwind through initrd.
> >> 
> >> What I'm working on now is adding some instrumentation to the dracut
> >> shutdown initrd scripts to see if I can how far we get down that path
> >> before the system fails to make forward progress, which may give some
> >> hints. TBD on that. I've also enabled lockdep with CONFIG_PROVE_RCU and
> >> a plethora of DEBUG options [2], and didn't get anything interesting.
> >> To be clear, we haven't seen lockdep spit out any complaints as of yet.
> > 
> > The fact that lockdep doesn't report anything is interesting but then
> > lockdep doesn't track everything. In particular I think SRCU itself isn't
> > tracked by lockdep.
> > 
> > Honza
> > -- 
> > Jan Kara <jack@xxxxxxxx>
> > SUSE Labs, CR
> 
>