Re: INFO: rcu detected stall in ext4_write_checks

Dmitry Vyukov <dvyukov@xxxxxxxxxx> · Fri, 5 Jul 2019 15:18:06 +0200

On Wed, Jun 26, 2019 at 8:43 PM Theodore Ts'o <tytso@xxxxxxx> wrote:
>
> On Wed, Jun 26, 2019 at 10:27:08AM -0700, syzbot wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:    abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1435aaf6a00000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b
> > dashboard link: https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11234c41a00000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15d7f026a00000
> >
> > The bug was bisected to:
> >
> > commit 0c81ea5db25986fb2a704105db454a790c59709c
> > Author: Elad Raz <eladr@xxxxxxxxxxxx>
> > Date:   Fri Oct 28 19:35:58 2016 +0000
> >
> >     mlxsw: core: Add port type (Eth/IB) set API
>
> Um, so this doesn't pass the laugh test.
>
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10393a89a00000
>
> It looks like the automated bisection machinery got confused by two
> failures getting triggered by the same repro; the symptoms changed
> over time.  Initially, the failure was:
>
> crashed: INFO: rcu detected stall in {sys_sendfile64,ext4_file_write_iter}
>
> Later, the failure changed to something completely different, and much
> earlier (before the test was even started):
>
> run #5: basic kernel testing failed: failed to copy test binary to VM: failed to run ["scp" "-P" "22" "-F" "/dev/null" "-o" "UserKnownHostsFile=/dev/null" "-o" "BatchMode=yes" "-o" "IdentitiesOnly=yes" "-o" "StrictHostKeyChecking=no" "-o" "ConnectTimeout=10" "-i" "/syzkaller/jobs/linux/workdir/image/key" "/tmp/syz-executor216456474" "root@10.128.15.205:./syz-executor216456474"]: exit status 1
> Connection timed out during banner exchange
> lost connection
>
> Looks like an opportunity to improve the bisection engine?

Hi Ted,

Yes, these infrastructure errors plague bisections episodically.
That's https://github.com/google/syzkaller/issues/1250

It did not confuse bisection explicitly as it understands that these
are infrastructure failures rather then a kernel crash, e.g. here you
may that it correctly identified that this run was OK and started
bisection in v4.10 v4.9 range besides 2 scp failures:

testing release v4.9
testing commit 69973b830859bc6529a7a0468ba0d80ee5117826 with gcc (GCC) 5.5.0
run #0: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ...]: exit status 1
Connection timed out during banner exchange
run #1: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ....]: exit status 1
Connection timed out during banner exchange
run #2: OK
run #3: OK
run #4: OK
run #5: OK
run #6: OK
run #7: OK
run #8: OK
run #9: OK
# git bisect start v4.10 v4.9

Though, of course, it may confuse bisection indirectly by reducing
number of tests per commit.

So far I wasn't able to gather any significant info about these
failures. We gather console logs, but on these runs they are empty.
It's easy to blame everything onto GCE but I don't have any bit of
information that would point either way. These failures just appear
randomly in production and usually in batches...