Re: qemu:metag image runtime failure in -next due to 'kthread: allow to cancel kthread work'

James Hogan <james.hogan@xxxxxxxxxx> · Mon, 19 Sep 2016 22:37:41 +0100

On Mon, Sep 19, 2016 at 12:53:49PM -0700, Kees Cook wrote:
> On Mon, Sep 19, 2016 at 6:59 AM, James Hogan <james.hogan@xxxxxxxxxx> wrote:
> > On Fri, Sep 16, 2016 at 07:58:12PM -0700, Kees Cook wrote:
> >> On Fri, Sep 16, 2016 at 4:32 PM, James Hogan <james.hogan@xxxxxxxxxx> wrote:
> >> > On Fri, Sep 16, 2016 at 02:37:18PM -0700, Guenter Roeck wrote:
> >> >> On Fri, Sep 16, 2016 at 10:27:20PM +0100, James Hogan wrote:
> >> >> > On Fri, Sep 16, 2016 at 01:38:19PM -0700, Guenter Roeck wrote:
> >> >> > > Hi,
> >> >> > >
> >> >> > > I see the following runtime error in -next when running a metag qemu emulation.
> >> >> > >
> >> >> > > [ ... ]
> >> >> > > workingset: timestamp_bits=30 max_order=16 bucket_order=0
> >> >> > > io scheduler noop registered (default)
> >> >> > > brd: module loaded
> >> >> > > Warning: unable to open an initial console.
> >> >> > > List of all partitions:
> >> >> > > 0100           16384 ram0  (driver?)
> >> >> > > No filesystem could mount root, tried:
> >> >> > > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(1,0)
> >> >> > >
> >> >> > > An example for a complete log is at:
> >> >> > > http://kerneltests.org/builders/qemu-metag-next/builds/489/steps/qemubuildcommand/logs/stdio
> >> >> > >
> >> >> > > bisect points to commit ef98de028afde ("kthread: allow to cancel kthread work").
> >> >> > > I don't know (yet) if other architectures are affected. bisect log is attached.
> >> >> > >
> >> >> > > The scripts to run this test are available at
> >> >> > > https://github.com/groeck/linux-build-test/tree/master/rootfs/metag.
> >> >> > >
> >> >> > > Guenter
> >> >> >
> >> >> > Thanks Guenter,
> >> >> >
> >> >> > It appears to be related to the command line. After that commit the
> >> >> > command line is shown as empty (rather than your "rdinit=/sbin/init
> >> >> > doreboot"), but it can still be overridden in the config and then it
> >> >> > continues to work.
> >> >> >
> >> >> Weird.
> >> >
> >> > Previously the Elf had a single load program header:
> >> >
> >> > Program Headers:
> >> >   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
> >> >   LOAD           0x004000 0x40000000 0x40000000 0x34b304 0x376230 RWE 0x4000
> >> >   NOTE           0x1ecc08 0x401e8c08 0x401e8c08 0x00000 0x00000 R   0x4
> >> >
> >> > QEMU puts the args at 40376230, straight after the load region (unlike a
> >> > real Meta Linux bootloader).
> >> >
> >> > After the above commit the ELF gets two load program headers, with a
> >> > small alignment gap in the middle:
> >> >
> >> > Program Headers:
> >> >   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
> >> >   LOAD           0x004000 0x40000000 0x40000000 0x18f118 0x18f118 R E 0x4000
> >> >   LOAD           0x194000 0x40190000 0x40190000 0x1bd304 0x1e8230 RWE 0x4000
> >> >   NOTE           0x1eec08 0x401eac08 0x401eac08 0x00000 0x00000 R   0x4
> >> >
> >> > Here this version of QEMU puts the args at where it thinks the end of
> >> > the loaded image is, which is based on the number of bytes copied from
> >> > the ELF, i.e. the total MemSiz's, not taking into account the alignment
> >> > gap in between, so it puts them at 0x40377348.
> >> >
> >> > But of course:
> >> > 40378230 B ___bss_stop
> >> >
> >> > so it wipes them out while clearing bss during early init.
> >> >
> >> > Previously:
> >> > 4018ebd0 T __sdata
> >> > 4018f000 R ___start_rodata
> >> >
> >> > now:
> >> > 4018ed98 T __sdata
> >> > 40190000 R ___start_rodata
> >> >
> >> > So I'm thinking this may have been triggered by c74ba8b3480d ("arch:
> >> > Introduce post-init read-only memory").
> >> >
> >> > The hack below does indeed reduce it to a single load section and this
> >> > version of QEMU then succeeds:
> >> >
> >> > Program Headers:
> >> >   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
> >> >   LOAD           0x004000 0x40000000 0x40000000 0x34d304 0x378230 RWE 0x4000
> >> >   NOTE           0x1eec88 0x401eac88 0x401eac88 0x00000 0x00000 R   0x4
> >> >
> >> > diff --git a/arch/metag/include/asm/cache.h b/arch/metag/include/asm/cache.h
> >> > index a43b650cfdc0..b5c7364a94da 100644
> >> > --- a/arch/metag/include/asm/cache.h
> >> > +++ b/arch/metag/include/asm/cache.h
> >> > @@ -20,4 +20,6 @@
> >> >
> >> >  #define __read_mostly __attribute__((__section__(".data..read_mostly")))
> >> >
> >> > +#define __ro_after_init        __read_mostly
> >> > +
> >> >  #endif
> >> >
> >> > Kees: Is it expected to get multiple load headers like this since your
> >> > patch c74ba8b3480d ("arch: Introduce post-init read-only memory"),
> >> > depending on alignment of the read only section?
> >>
> >> I'm not expecting that, and I'm especially not expecting any LOAD to
> >> have "RWE" flags. It looks like metag's vmlinux.lds.S is using the
> >> common RO_DATA_SECTION, so that doesn't jump out at me as a reason for
> >> RO_AFTER_INIT_DATA to end up in the wrong place. Also the second LOAD
> >> in the ELF is really huge, so it's not just the RO_AFTER_INIT_DATA
> >> section (which is currently very small).
> >>
> >> Could the metag linker be failing to override the section flags when
> >> putting RO_AFTER_INIT_DATA into RO_DATA_SECTION and this cuts the LOAD
> >> in half? This smells like a linker glitch.
> >
> > Where is that section flags override supposed to be happening?
> >
> > The .rodata section is indeed becoming slightly larger to fit the
> > ptmx_fops __ro_after_init, and in the process changing from flag A to
> > flag WA (presumably means becoming writable).
> >
> > This then appears to be hitting the condition in the linker whereby a
> > new segment is started if a writable section is found after a readonly
> > section, and the sections are on separate pages:
> >
> > https://sourceware.org/git/?p=binutils.git;a=blob;f=bfd/elf.c;h=8df38ee37923c020994715f111a0ffc6bae83c8c;hb=HEAD#l3952
> 
> Hm, well, I think x86, arm, and arm64 don't do this. They just clobber
> the flags from the section and force it to match the master section
> (i.e. ro_after_init gets rodata's flags).

Okay. Please forgive my ignorance, I'm just trying to understand the
mechanism, but is that thought to happen automatically due to
*(.data..ro_after_init) being in a section after *(.rodata) in the
linker script?

The definition in question is:
static struct file_operations ptmx_fops __ro_after_init;

Which isn't const or anything, so I'm not sure how else the linker is
supposed to know to make a section read-only that contains non-read-only
data.

Okay, I just built x86_64 default defconfig (on ef98de028afd, half way
through the mm patches on linux-next from the other day where metag
stopped booting). Perhaps I'm missing some important config option to
enable the memory protection (if so I appologise).

For metag:

$ readelf -S drivers/tty/pty.o
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [51] .data..ro_after_i PROGBITS        00000000 00f0c0 00007c 00  WA  0   0  4

$ readelf -S vmlinux.bust:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 4] .rodata           PROGBITS        40190000 194000 04c9c8 00  WA  0   0 64

And x86_64:

$ readelf -S drivers/tty/pty.o
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [18] .data..ro_after_i PROGBITS         0000000000000000  00001140
       00000000000000f8  0000000000000000  WA       0     0     64

$ readelf -S vmlinux
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 4] .rodata           PROGBITS         ffffffff81a00000  00c00000
       00000000002663d0  0000000000000000  WA       0     0     4096

Both have WA on that section in the object file and the final vmlinux
ELF too.

Thanks
James

> 
> >> However, since metag doesn't support CONFIG_DEBUG_RODATA, your above
> >> patch is likely correct. If metag marks memory read-only at kernel
> >> load time, it's doing it early enough that __ro_after_init will fail
> >> to work. If it never marks memory read-only, then __ro_after_init will
> >> have no effect.
> >
> > Yes, metag doesn't make that memory read only yet, although it would be
> > possible to do so (maybe delaying the switch to 4MB pages until
> > afterwards).
> >
> > Either way it doesn't sound that unreasonable to have multiple load
> > segments generated.
> 
> The bug here is that the ro_after_init portion of .rodata shouldn't be
> writable according to the ELF headers. It should be writable only
> because memory protection of .rodata hasn't happened yet. :)
> 
> >> (i.e. Both situations need the proposed patch).
> >
> > purely to avoid the curious linker segment behaviour (which only trips
> > up that QEMU because its broken), or for other reasons too?
> 
> Well, it means the linker behavior will go away since there will be no
> mixing of sections with different flags, and it'll match current
> expectations: __ro_after_init won't be read-only ever (same as
> .rodata).
> 
> Fixing memory permissions for metag would be nice, though! :)
> 
> -Kees
> 
> >
> > Thanks
> > James
> >
> >>
> >> Probably the best solution for all architectures is to invent a new
> >> CONFIG_ARCH_HAS... to identify the style of kernel memory protection a
> >> given architecture has so that the default for __ro_after_init can be
> >> more sensible (and the warnings about mark_rodata_ro() can better
> >> reflect reality).
> >>
> >> -Kees
> >>
> >> --
> >> Kees Cook
> >> Nexus Security
> 
> 
> 
> -- 
> Kees Cook
> Nexus Security
Attachment:
signature.asc

Description: Digital signature