Re: [syzbot] INFO: task hung in jbd2_journal_commit_transaction (3)

Dmitry Vyukov <dvyukov@xxxxxxxxxx> · Fri, 20 May 2022 13:57:07 +0200

On Wed, 22 Dec 2021 at 05:35, Theodore Ts'o <tytso@xxxxxxx> wrote:
>
> On Wed, Dec 22, 2021 at 10:25:27AM +0800, Hillf Danton wrote:
> > > I'm not sure what you hope to learn by doing something like that.
> > > That will certainly perturb the system, but every 150 seconds, the
> > > task is going to let other tasks/threads run --- but it will be
> > > whatever is the next highest priority thread.
> >
> > Without reproducer, I am trying to reproduce the issue using a FIFO CPU hog
> > which is supposed to beat the watchdog to show me the victims like various
> > kthreads, workqueue workers and user apps, despite I know zero about how the
> > watchdog is configured except the report was down to watchdog bite.
>
> It's really trivial to reproduce an issue that has the same symptom as
> what has been reported to you.  Mount the file system using a
> non-real-time (SCHED_OTHER) thread, such that the jbd2 and ext4 worker
> threads are running SCHED_OTHER.  Then run some file system workload
> (fsstress or fsmark) as SCHED_FIFO.  Then on an N CPU system, run N
> processes as SCHED_FIFO at any priority (doesn't matter whether it's
> MAX_PRI-1 or MIN_PRI; SCHED_FIFO will have priority over SCHED_OTHER
> processes, so this will effectively starve the ext4 and jbd2 worker
> threads from ever getting to run.  Once the ext4 journal fills up, any
> SCHED_FIFO process which tries to write to the file system will hang.
>
> The problem is that's *one* potential stupid configuration of the
> real-time system.  It's not necessarily the *only* potentially stupid
> way that you can get yourself into a system hang.  It appears the
> syzkaller "repro" is another such "stupid way".  And the number of
> ways you can screw up with a real-time system is practically
> unbounded...
>
> So getting back to syzkaller, Willy had the right approach, which is a
> Syzcaller "repro" happens to use SCHED_FIFO or SCHED_RR, and the
> symptom is a system hang, it's probably worth ignoring the report,
> since it's going to be a waste of time to debug userspace bug.  If you
> have anything that uses kernel threads, and SCHED_FIFO or SCHED_RR is
> in play, it's probably a userspace bug.
>
> Cheers,

Hi Ted,

Reviving this old thread re syzkaller using SCHED_FIFO.

It's a bit hard to restrict what the fuzzer can do if we give it
sched_setattr() and friends syscalls. We could remove them from the
fuzzer entirely, but it's probably suboptimal as well.

I see that setting up SCHED_FIFO is guarded by CAP_SYS_NICE:
https://elixir.bootlin.com/linux/v5.18-rc7/source/kernel/sched/core.c#L7264

And I see we drop CAP_SYS_NICE from the fuzzer process since 2019
(after a similar discussion):
https://github.com/google/syzkaller/commit/f3ad68446455a

The latest C reproducer contains:

static void drop_caps(void)
{
  struct __user_cap_header_struct cap_hdr = {};
  struct __user_cap_data_struct cap_data[2] = {};
  cap_hdr.version = _LINUX_CAPABILITY_VERSION_3;
  cap_hdr.pid = getpid();
  if (syscall(SYS_capget, &cap_hdr, &cap_data))
    exit(1);
  const int drop = (1 << CAP_SYS_PTRACE) | (1 << CAP_SYS_NICE);
  cap_data[0].effective &= ~drop;
  cap_data[0].permitted &= ~drop;
  cap_data[0].inheritable &= ~drop;
  if (syscall(SYS_capset, &cap_hdr, &cap_data))
    exit(1);
}

Are we holding it wrong? How can the process manage to set any bad
scheduling policies if it dropped CAP_SYS_NICE?...
The process still has CAP_SYS_ADMIN, but I assume it should not allow
it using something that requires dropped CAP_SYS_NICE.