Re: mmotm boot panic bootmem-avoid-dma32-zone-by-default.patch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Thu, Mar 04, 2010 at 09:14:15PM -0800, Yinghai Lu wrote:
> On 03/04/2010 09:00 PM, Yinghai Lu wrote:
> > On 03/04/2010 07:21 PM, Johannes Weiner wrote:
> >> Hello Greg,
> >>
> >> On Thu, Mar 04, 2010 at 01:21:41PM -0800, Greg Thelen wrote:
> >>> On several systems I am seeing a boot panic if I use mmotm
> >>> (stamp-2010-03-02-18-38).  If I remove
> >>> bootmem-avoid-dma32-zone-by-default.patch then no panic is seen.  I
> >>> find that:
> >>> * 2.6.33 boots fine.
> >>> * 2.6.33 + mmotm w/o bootmem-avoid-dma32-zone-by-default.patch: boots fine.
> >>> * 2.6.33 + mmotm (including
> >>> bootmem-avoid-dma32-zone-by-default.patch): panics.
> >>> Note: I had to enable earlyprintk to see the panic.  Without
> >>> earlyprintk no console output was seen.  The system appeared to hang
> >>> after the loader.
> >>
> >> Thanks for your report.  A few notes below.
> >>
> >>> Here's the panic seen with earlyprintk using 2.6.33 + mmotm:
> >>>
> >>> Starting up ...
> >>> [    0.000000] Initializing cgroup subsys cpuset
> >>> [    0.000000] Initializing cgroup subsys cpu
> >>> [    0.000000] Linux version 2.6.33-mm1+
> >>> (gthelen@xxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 4.2.4 (Ubuntu
> >>> 4.2.4-1ubuntu4)) #1 SMP Thu Mar 4 12:03:29 PST 2010
> >>> [    0.000000] Command line:
> >>> root=UUID=a77f406a-7cc7-4f49-9cc2-818b2b4159ae ro console=tty0
> >>> console=ttyS0,115200n8 earlyprintk=serial,ttyS0,9600
> >>> [    0.000000] BIOS-provided physical RAM map:
> >>> [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> >>> [    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> >>> [    0.000000]  BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
> >>> [    0.000000]  BIOS-e820: 0000000000100000 - 000000000fff0000 (usable)
> >>> [    0.000000]  BIOS-e820: 000000000fff0000 - 0000000010000000 (ACPI data)
> >>> [    0.000000]  BIOS-e820: 00000000fffbd000 - 0000000100000000 (reserved)
> >>> [    0.000000] bootconsole [earlyser0] enabled
> >>> [    0.000000] NX (Execute Disable) protection: active
> >>> [    0.000000] DMI 2.4 present.
> >>> [    0.000000] No AGP bridge found
> >>> [    0.000000] last_pfn = 0xfff0 max_arch_pfn = 0x400000000
> >>> [    0.000000] PAT not supported by CPU.
> >>> [    0.000000] CPU MTRRs all blank - virtualized system.
> >>> [    0.000000] Scanning 1 areas for low memory corruption
> >>> [    0.000000] modified physical RAM map:
> >>> [    0.000000]  modified: 0000000000000000 - 0000000000010000 (reserved)
> >>> [    0.000000]  modified: 0000000000010000 - 000000000009fc00 (usable)
> >>> [    0.000000]  modified: 000000000009fc00 - 00000000000a0000 (reserved)
> >>> [    0.000000]  modified: 00000000000e8000 - 0000000000100000 (reserved)
> >>> [    0.000000]  modified: 0000000000100000 - 000000000fff0000 (usable)
> >>> [    0.000000]  modified: 000000000fff0000 - 0000000010000000 (ACPI data)
> >>> [    0.000000]  modified: 00000000fffbd000 - 0000000100000000 (reserved)
> >>> [    0.000000] init_memory_mapping: 0000000000000000-000000000fff0000
> >>
> >> 256MB of memory, right?
> >>
> >>> [    0.000000] RAMDISK: 0fd9d000 - 0ffdf539
> >>> [    0.000000] ACPI: RSDP 00000000000fb450 00014 (v00 QEMU  )
> >>> [    0.000000] ACPI: RSDT 000000000fff0000 00030 (v01 QEMU   QEMURSDT
> >>> 00000001 QEMU 00000001)
> >>> [    0.000000] ACPI: FACP 000000000fff0030 00074 (v01 QEMU   QEMUFACP
> >>> 00000001 QEMU 00000001)
> >>> [    0.000000] ACPI: DSDT 000000000fff0100 0089D (v01   BXPC   BXDSDT
> >>> 00000001 INTL 20061109)
> >>> [    0.000000] ACPI: FACS 000000000fff00c0 00040
> >>> [    0.000000] ACPI: APIC 000000000fff09d8 00068 (v01 QEMU   QEMUAPIC
> >>> 00000001 QEMU 00000001)
> >>> [    0.000000] ACPI: SSDT 000000000fff099d 00037 (v01 QEMU   QEMUSSDT
> >>> 00000001 QEMU 00000001)
> >>> [    0.000000] No NUMA configuration found
> >>> [    0.000000] Faking a node at 0000000000000000-000000000fff0000
> >>> [    0.000000] Initmem setup node 0 0000000000000000-000000000fff0000
> >>> [    0.000000]   NODE_DATA [0000000001c4e040 - 0000000001c5303f]
> >>> [    0.000000] BUG: unable to handle kernel NULL pointer dereference at (null)
> >>> [    0.000000] IP: [<ffffffff81b0f5f7>] memory_present+0x9a/0xbf
> >>> [    0.000000] PGD 0
> >>> [    0.000000] Oops: 0000 [#1] SMP
> >>> [    0.000000] last sysfs file:
> >>> [    0.000000] CPU 0
> >>> [    0.000000] Modules linked in:
> >>> [    0.000000]
> >>> [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.33-mm1+ #1 /
> >>> [    0.000000] RIP: 0010:[<ffffffff81b0f5f7>]  [<ffffffff81b0f5f7>]
> >>> memory_present+0x9a/0xbf
> >>> [    0.000000] RSP: 0000:ffffffff81a01e18  EFLAGS: 00010046
> >>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
> >>> [    0.000000] RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000000000000000
> >>> [    0.000000] RBP: ffffffff81a01e58 R08: ffffffffffffffff R09: 0000000000000040
> >>> [    0.000000] R10: ffff880001c4e040 R11: 0000000000004100 R12: 0000000000000000
> >>> [    0.000000] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> >>> [    0.000000] FS:  0000000000000000(0000) GS:ffffffff81adf000(0000)
> >>> knlGS:0000000000000000
> >>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [    0.000000] CR2: 0000000000000000 CR3: 0000000001a08000 CR4: 00000000000000b0
> >>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> [    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >>> [    0.000000] Process swapper (pid: 0, threadinfo ffffffff81a00000,
> >>> task ffffffff81a10020)
> >>> [    0.000000] Stack:
> >>> [    0.000000]  000000000fff0000 000000000000009f 0000000000000000
> >>> 0000000000000000
> >>> [    0.000000] <0> 0000000000000040 ffffffff81a01ef8 0000000000000000
> >>> 0000000000000000
> >>> [    0.000000] <0> ffffffff81a01e78 ffffffff81b0dd0e ffffffff81a01e88
> >>> 000000000fff0000
> >>> [    0.000000] Call Trace:
> >>> [    0.000000]  [<ffffffff81b0dd0e>]
> >>> sparse_memory_present_with_active_regions+0x31/0x47
> >>> [    0.000000]  [<ffffffff81b0688a>] paging_init+0x3f/0x5b
> >>> [    0.000000]  [<ffffffff81af81a7>] setup_arch+0x964/0xa03
> >>> [    0.000000]  [<ffffffff8103014a>] ? need_resched+0x1e/0x28
> >>> [    0.000000]  [<ffffffff8103015d>] ? should_resched+0x9/0x2a
> >>> [    0.000000]  [<ffffffff8152de24>] ? _cond_resched+0x9/0x1d
> >>> [    0.000000]  [<ffffffff81af4a34>] start_kernel+0x9f/0x382
> >>> [    0.000000]  [<ffffffff81af4299>] x86_64_start_reservations+0xa9/0xad
> >>> [    0.000000]  [<ffffffff81af4383>] x86_64_start_kernel+0xe6/0xed
> >>> [    0.000000] Code: c7 00 56 c2 81 e8 a0 f9 a1 ff 48 83 3c dd 00 16
> >>> c2 81 00 75 08 4c 89 2c dd 00 16 c2 81 fe 05 11 60 11 00 4c 89 ff e8
> >>> 85 3b 5c ff <48> 83 38 00 75 03 4c 89 30 49 81 c4 00 80 00 00 4c 3b 65
> >>> c8 72
> >>> [    0.000000] RIP  [<ffffffff81b0f5f7>] memory_present+0x9a/0xbf
> >>> [    0.000000]  RSP <ffffffff81a01e18>
> >>> [    0.000000] CR2: 0000000000000000
> >>> [    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
> >>> [    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
> >>> [    0.000000] Pid: 0, comm: swapper Tainted: G      D    2.6.33-mm1+ #1
> >>> [    0.000000] Call Trace:
> >>> [    0.000000]  [<ffffffff8103c78c>] panic+0x9e/0x113
> >>> [    0.000000]  [<ffffffff8103d3d6>] ? printk+0x67/0x69
> >>> [    0.000000]  [<ffffffff8105914e>] ? blocking_notifier_call_chain+0xf/0x11
> >>> [    0.000000]  [<ffffffff8103f8b4>] do_exit+0x78/0x70f
> >>> [    0.000000]  [<ffffffff8103ca2f>] ? spin_unlock_irqrestore+0x9/0xb
> >>> [    0.000000]  [<ffffffff8103dcde>] ? kmsg_dump+0x112/0x138
> >>> [    0.000000]  [<ffffffff81530061>] oops_end+0xb2/0xba
> >>> [    0.000000]  [<ffffffff810258d3>] no_context+0x1f5/0x204
> >>> [    0.000000]  [<ffffffff81025b1b>] __bad_area_nosemaphore+0x17f/0x1a2
> >>> [    0.000000]  [<ffffffff81025bb4>] bad_area_nosemaphore+0xe/0x10
> >>> [    0.000000]  [<ffffffff81531e36>] do_page_fault+0x122/0x24c
> >>> [    0.000000]  [<ffffffff8152f59f>] page_fault+0x1f/0x30
> >>> [    0.000000]  [<ffffffff81b0f5f7>] ? memory_present+0x9a/0xbf
> >>> [    0.000000]  [<ffffffff81b0f5f7>] ? memory_present+0x9a/0xbf
> >>> [    0.000000]  [<ffffffff81b0dd0e>]
> >>> sparse_memory_present_with_active_regions+0x31/0x47
> >>> [    0.000000]  [<ffffffff81b0688a>] paging_init+0x3f/0x5b
> >>> [    0.000000]  [<ffffffff81af81a7>] setup_arch+0x964/0xa03
> >>> [    0.000000]  [<ffffffff8103014a>] ? need_resched+0x1e/0x28
> >>> [    0.000000]  [<ffffffff8103015d>] ? should_resched+0x9/0x2a
> >>> [    0.000000]  [<ffffffff8152de24>] ? _cond_resched+0x9/0x1d
> >>> [    0.000000]  [<ffffffff81af4a34>] start_kernel+0x9f/0x382
> >>> [    0.000000]  [<ffffffff81af4299>] x86_64_start_reservations+0xa9/0xad
> >>> [    0.000000]  [<ffffffff81af4383>] x86_64_start_kernel+0xe6/0xed
> >>>
> >>> The kernel was built with 'make mrproper && make defconfig && make
> >>> ARCH=x86_64 CONFIG=smp -j 6'.  This panic is seen on every attempt, so
> >>> I can provide more diagnostics.
> >>
> >> Okay, if you did defconfig and just hit enter to all questions, you
> >> should have SPARSEMEM_EXTREME and NO_BOOTMEM enabled.  This means that
> >> the 'mem_section' is an array of pointers and the following happens in
> >> memory_present():
> >>
> >> 	for_one_pfn_in_each_section() {
> >> 		sparse_index_init(); /* no return value check */
> >> 		ms = __nr_to_section();
> >> 		if (!ms->section_mem_map) /* bang */
> >> 			...;
> >> 	}
> >>
> >> where sparse_index_init(), in the SPARSEMEM_EXTREME case, will allocate
> >> the mem_section descriptor with bootmem.  If this would fail, the box
> >> would panic immediately earlier, but NO_BOOTMEM does not seem to get it
> >> right.
> >>
> >> Greg, could you retry _with_ my bootmem patch applied, but with setting
> >> CONFIG_NO_BOOTMEM=n up front?
> >>
> >> I think NO_BOOTMEM has several problems.  Yinghai, can you verify them?
> >>
> >> 1. It does not seem to handle goal appropriately: bootmem would try
> >> without the goal if it does not make sense.  And in this case, the
> >> goal is 4G (above DMA32) and the amount of memory is 256M.
> >>
> >> And if I did not miss something, this is the difference with my patch:
> >> without it, the default goal is 16M, which is no problem as it is well
> >> within your available memory.  But the change of the default goal moved
> >> it outside it which the bootmem replacement can not handle.
> >>
> >> 2. The early reservation stuff seems to return NULL but callsites assume
> >> that the bootmem interface never does that.  Okay, the result is the same,
> >> we crash.  But it still moves error reporting to a possibly much later
> >> point where somebody actually dereferences the returned pointer.
> > 
> > related change could be: __alloc_bootmem_node_high...
> 
> no should be here...
> 
> static struct mem_section noinline __init_refok *sparse_index_alloc(int nid)
> {                                     
>         struct mem_section *section = NULL;
>         unsigned long array_size = SECTIONS_PER_ROOT *
>                                    sizeof(struct mem_section);
> 
>         if (slab_is_available()) {
>                 if (node_state(nid, N_HIGH_MEMORY))
>                         section = kmalloc_node(array_size, GFP_KERNEL, nid);
>                 else
>                         section = kmalloc(array_size, GFP_KERNEL);
>         } else
>                 section = alloc_bootmem_node(NODE_DATA(nid), array_size);
> 
> and
> 
> #define alloc_bootmem_node(pgdat, x) \
>         __alloc_bootmem_node(pgdat, x, SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
> 
> 
> then you change that goal MAX_DMA_ADDRESS to 4g..., but the system only have 256M 

and alloc_bootmem_core() will handle it.  The principle of the default goal is:
if you have memory outside the DMA zone, use that if possible.  If not, just use
what's there.

So increasing the default goal to above the DMA32 zone and falling back if not
possible is a sensible change in itself.

Replacing the bootmem API implementation with something incompatible is NOT a
sensible change, however.  You have to do the fallback or review all callers
and make sure they conform to your new semantics.

My patch just shows that with common machines: those with <=4G of memory
but you already broke uncommon machines without my patch, those with
<=16M of memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]