Re: Patch 'writeback, cgroup: release dying cgwbs by switching attached inodes' leads to kernel crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jan,

On Wed, Jul 14, 2021 at 5:26 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Wed 14-07-21 16:44:33, Boyang Xue wrote:
> > Hi Roman,
> >
> > On Wed, Jul 14, 2021 at 12:12 PM Roman Gushchin <guro@xxxxxx> wrote:
> > >
> > > On Wed, Jul 14, 2021 at 11:21:12AM +0800, Boyang Xue wrote:
> > > > Hello,
> > > >
> > > > I'm not sure if this is the right place to report this bug, please
> > > > correct me if I'm wrong.
> > > >
> > > > I found kernel-5.14.0-rc1 (built from the Linus tree) crash when it's
> > > > running xfstests generic/256 on ext4 [1]. Looking at the call trace,
> > > > it looks like the bug had been introduced by the commit
> > > >
> > > > c22d70a162d3 writeback, cgroup: release dying cgwbs by switching attached inodes
> > > >
> > > > It only happens on aarch64, not on x86_64, ppc64le and s390x. Testing
> > > > was performed with the latest xfstests, and the bug can be reproduced
> > > > on ext{2, 3, 4} with {1k, 2k, 4k} block sizes.
> > >
> > > Hello Boyang,
> > >
> > > thank you for the report!
> > >
> > > Do you know on which line the oops happens?
> >
> > I was trying to inspect the vmcore with crash utility, but
> > unfortunately it doesn't work.
>
> Thanks for report!  Have you tried addr2line utility? Looking at the oops I
> can see:

Thanks for the tips!

It's unclear to me that where to find the required address in the
addr2line command line, i.e.

addr2line -e /usr/lib/debug/lib/modules/5.14.0-0.rc1.15.bx.el9.aarch64/vmlinux
<what address here?>

But I have tried gdb like this,

# gdb /usr/lib/debug/lib/modules/5.14.0-0.rc1.15.bx.el9.aarch64/vmlinux
GNU gdb (GDB) Red Hat Enterprise Linux 10.1-14.el9
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from
/usr/lib/debug/lib/modules/5.14.0-0.rc1.15.bx.el9.aarch64/vmlinux...
(gdb) list *(cleanup_offline_cgwbs_workfn+0x320)
0xffff8000102d6ddc is in cleanup_offline_cgwbs_workfn
(./arch/arm64/include/asm/jump_label.h:38).
33      }
34
35      static __always_inline bool arch_static_branch_jump(struct
static_key *key,
36                                                          bool branch)
37      {
38              asm_volatile_goto(
39                      "1:     b               %l[l_yes]               \n\t"
40                       "      .pushsection    __jump_table, \"aw\"    \n\t"
41                       "      .align          3                       \n\t"
42                       "      .long           1b - ., %l[l_yes] - .   \n\t"
(gdb)

I'm not sure is it meaningful?

>
> [ 4371.307867] pc : cleanup_offline_cgwbs_workfn+0x320/0x394
>
> Which means there's probably heavy inlining going on (do you use LTO by
> any chance?) because I don't think cleanup_offline_cgwbs_workfn() itself
> would compile into ~1k of code (but I don't have much experience with
> aarch64). Anyway, add2line should tell us.

Actually I built the kernel on an internal build service, so I don't
know much of the build details, like LTO.

>
> Also pasting oops into scripts/decodecode on aarch64 machine should tell
> us more about where and why the kernel crashed.

The output is:

# echo "Code: d63f0020 97f99963 17ffffa6 f8588263 (f9400061)" |
/usr/src/kernels/5.14.0-0.rc1.15.bx.el9.aarch64/scripts/decodecode
Code: d63f0020 97f99963 17ffffa6 f8588263 (f9400061)
All code
========
   0:   d63f0020        blr     x1
   4:   97f99963        bl      0xffffffffffe66590
   8:   17ffffa6        b       0xfffffffffffffea0
   c:   f8588263        ldur    x3, [x19, #-120]
  10:*  f9400061        ldr     x1, [x3]                <-- trapping instruction

Code starting with the faulting instruction
===========================================
   0:   f9400061        ldr     x1, [x3]

>
>                                                                 Honza
>
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR
>

Thanks,
Boyang




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux