+ mm-access-to-uninitialized-struct-page.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Mon, 30 Apr 2018 16:27:11 -0700

The patch titled
     Subject: mm: access to uninitialized struct page
has been added to the -mm tree.  Its filename is
     mm-access-to-uninitialized-struct-page.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-access-to-uninitialized-struct-page.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-access-to-uninitialized-struct-page.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
Subject: mm: access to uninitialized struct page

The following two bugs were reported by Fengguang Wu:

kernel reboot-without-warning in early-boot stage, last printk: early
console in setup code

http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@xxxxxxxxxxxxxxxxxxxxxx

And, also:
[per_cpu_ptr_to_phys] PANIC: early exception 0x0d
IP 10:ffffffffa892f15f error 0 cr2 0xffff88001fbff000

http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@xxxxxxxxxxxxxxxxxxxxxx

Both of the problems are due to accessing uninitialized struct page from
trap_init().  We must first do mm_init() in order to initialize allocated
struct pages, and than we can access fields of any struct page that
belongs to memory that's been allocated.

Below is explanation of the root cause.

The issue arises in this stack:

start_kernel()
 trap_init()
  setup_cpu_entry_areas()
   setup_cpu_entry_area(cpu)
    get_cpu_gdt_paddr(cpu)
     per_cpu_ptr_to_phys(addr)
      pcpu_addr_to_page(addr)
       virt_to_page(addr)
        pfn_to_page(__pa(addr) >> PAGE_SHIFT)

The returned "struct page" is sometimes uninitialized, and thus failing
later when used.  It turns out sometimes is because it depends on KASLR.

When boot is failing we have this when  pfn_to_page() is called:
kasrl: 0x000000000d600000
 addr: ffffffff83e0d000
    pa: 1040d000
   pfn: 1040d
page: ffff88001f113340
page->flags ffffffffffffffff <- Uninitialized!

When boot is successful:
kaslr: 0x000000000a800000
 addr: ffffffff83e0d000
     pa: d60d000
    pfn: d60d
 page: ffff88001f05b340
page->flags 280000000000 <- Initialized!

Here are physical addresses that BIOS provided to us:
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x000000001ffdffff] usable
BIOS-e820: [mem 0x000000001ffe0000-0x000000001fffffff] reserved
BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved

In both cases, working and non-working the real physical address is
the same:

pa - kasrl = 0x2E0D000

The only thing that is different is PFN.

We initialize struct pages in four places:

1. Early in boot a small set of struct pages is initialized to fill
   the first section, and lower zones.

2. During mm_init() we initialize "struct pages" for all the memory
   that is allocated, i.e reserved in memblock.

3. Using on-demand logic when pages are allocated after mm_init call

4. After smp_init() when the rest free deferred pages are initialized.

The above path happens before deferred memory is initialized, and thus it
must be covered either by 1, 2 or 3.

So, lets check what PFNs are initialized after (1).

memmap_init_zone() is called for pfn ranges:
1 - 1000, and 1000 - 1ffe0, but it quits after reaching pfn 0x10000,
as it leaves the rest to be initialized as deferred pages.

In the working scenario pfn ended up being below 1000, but in the failing
scenario it is above.  Hence, we must initialize this page in (2).  But
trap_init() is called before mm_init().

The bug was introduced by "mm: initialize pages on demand during boot"
because we lowered amount of pages that is initialized in the step (1). 
But, it still could happen, because the number of initialized pages was a
guessing.

The current fix moves trap_init() to be called after mm_init, but as
alternative, we could increase pgdat->static_init_pgcnt: In
free_area_init_node we can increase:

       pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                        pgdat->node_spanned_pages);

Instead of one PAGES_PER_SECTION, set several, so the text is covered for
all KASLR offsets.  But, this would still be guessing.  Therefore, I
prefer the current fix.

Link: http://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@xxxxxxxxxx
Fixes: c9e97a1997fb ("mm: initialize pages on demand during boot")
Signed-off-by: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
Reviewed-by: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
Cc: Steven Sistare <steven.sistare@xxxxxxxxxx>
Cc: Daniel Jordan <daniel.m.jordan@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
Cc: Fengguang Wu <fengguang.wu@xxxxxxxxx>
Cc: Dennis Zhou <dennisszhou@xxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 init/main.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN init/main.c~mm-access-to-uninitialized-struct-page init/main.c

--- a/init/main.c~mm-access-to-uninitialized-struct-page
+++ a/init/main.c
@@ -585,8 +585,8 @@ asmlinkage __visible void __init start_k
 	setup_log_buf(0);
 	vfs_caches_init_early();
 	sort_main_extable();
-	trap_init();
 	mm_init();
+	trap_init();
 
 	ftrace_init();
 
_

Patches currently in -mm which might be from pasha.tatashin@xxxxxxxxxx are

mm-sections-are-not-offlined-during-memory-hotremove.patch
mm-access-to-uninitialized-struct-page.patch
sparc64-ng4-memset-32-bits-overflow.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html