Re: [syzbot] INFO: task hung in jbd2_journal_commit_transaction (3)

"Theodore Ts'o" <tytso@xxxxxxx> · Tue, 21 Dec 2021 23:35:41 -0500

On Wed, Dec 22, 2021 at 10:25:27AM +0800, Hillf Danton wrote:
> > I'm not sure what you hope to learn by doing something like that.
> > That will certainly perturb the system, but every 150 seconds, the
> > task is going to let other tasks/threads run --- but it will be
> > whatever is the next highest priority thread. 
> 
> Without reproducer, I am trying to reproduce the issue using a FIFO CPU hog
> which is supposed to beat the watchdog to show me the victims like various
> kthreads, workqueue workers and user apps, despite I know zero about how the
> watchdog is configured except the report was down to watchdog bite.

It's really trivial to reproduce an issue that has the same symptom as
what has been reported to you.  Mount the file system using a
non-real-time (SCHED_OTHER) thread, such that the jbd2 and ext4 worker
threads are running SCHED_OTHER.  Then run some file system workload
(fsstress or fsmark) as SCHED_FIFO.  Then on an N CPU system, run N
processes as SCHED_FIFO at any priority (doesn't matter whether it's
MAX_PRI-1 or MIN_PRI; SCHED_FIFO will have priority over SCHED_OTHER
processes, so this will effectively starve the ext4 and jbd2 worker
threads from ever getting to run.  Once the ext4 journal fills up, any
SCHED_FIFO process which tries to write to the file system will hang.

The problem is that's *one* potential stupid configuration of the
real-time system.  It's not necessarily the *only* potentially stupid
way that you can get yourself into a system hang.  It appears the
syzkaller "repro" is another such "stupid way".  And the number of
ways you can screw up with a real-time system is practically
unbounded...

So getting back to syzkaller, Willy had the right approach, which is a
Syzcaller "repro" happens to use SCHED_FIFO or SCHED_RR, and the
symptom is a system hang, it's probably worth ignoring the report,
since it's going to be a waste of time to debug userspace bug.  If you
have anything that uses kernel threads, and SCHED_FIFO or SCHED_RR is
in play, it's probably a userspace bug.

Cheers,

					- Ted