On 03/04/2010 09:00 PM, Yinghai Lu wrote: > On 03/04/2010 07:21 PM, Johannes Weiner wrote: >> Hello Greg, >> >> On Thu, Mar 04, 2010 at 01:21:41PM -0800, Greg Thelen wrote: >>> On several systems I am seeing a boot panic if I use mmotm >>> (stamp-2010-03-02-18-38). If I remove >>> bootmem-avoid-dma32-zone-by-default.patch then no panic is seen. I >>> find that: >>> * 2.6.33 boots fine. >>> * 2.6.33 + mmotm w/o bootmem-avoid-dma32-zone-by-default.patch: boots fine. >>> * 2.6.33 + mmotm (including >>> bootmem-avoid-dma32-zone-by-default.patch): panics. >>> Note: I had to enable earlyprintk to see the panic. Without >>> earlyprintk no console output was seen. The system appeared to hang >>> after the loader. >> >> Thanks for your report. A few notes below. >> >>> Here's the panic seen with earlyprintk using 2.6.33 + mmotm: >>> >>> Starting up ... >>> [ 0.000000] Initializing cgroup subsys cpuset >>> [ 0.000000] Initializing cgroup subsys cpu >>> [ 0.000000] Linux version 2.6.33-mm1+ >>> (gthelen@xxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 4.2.4 (Ubuntu >>> 4.2.4-1ubuntu4)) #1 SMP Thu Mar 4 12:03:29 PST 2010 >>> [ 0.000000] Command line: >>> root=UUID=a77f406a-7cc7-4f49-9cc2-818b2b4159ae ro console=tty0 >>> console=ttyS0,115200n8 earlyprintk=serial,ttyS0,9600 >>> [ 0.000000] BIOS-provided physical RAM map: >>> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) >>> [ 0.000000] BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) >>> [ 0.000000] BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved) >>> [ 0.000000] BIOS-e820: 0000000000100000 - 000000000fff0000 (usable) >>> [ 0.000000] BIOS-e820: 000000000fff0000 - 0000000010000000 (ACPI data) >>> [ 0.000000] BIOS-e820: 00000000fffbd000 - 0000000100000000 (reserved) >>> [ 0.000000] bootconsole [earlyser0] enabled >>> [ 0.000000] NX (Execute Disable) protection: active >>> [ 0.000000] DMI 2.4 present. >>> [ 0.000000] No AGP bridge found >>> [ 0.000000] last_pfn = 0xfff0 max_arch_pfn = 0x400000000 >>> [ 0.000000] PAT not supported by CPU. >>> [ 0.000000] CPU MTRRs all blank - virtualized system. >>> [ 0.000000] Scanning 1 areas for low memory corruption >>> [ 0.000000] modified physical RAM map: >>> [ 0.000000] modified: 0000000000000000 - 0000000000010000 (reserved) >>> [ 0.000000] modified: 0000000000010000 - 000000000009fc00 (usable) >>> [ 0.000000] modified: 000000000009fc00 - 00000000000a0000 (reserved) >>> [ 0.000000] modified: 00000000000e8000 - 0000000000100000 (reserved) >>> [ 0.000000] modified: 0000000000100000 - 000000000fff0000 (usable) >>> [ 0.000000] modified: 000000000fff0000 - 0000000010000000 (ACPI data) >>> [ 0.000000] modified: 00000000fffbd000 - 0000000100000000 (reserved) >>> [ 0.000000] init_memory_mapping: 0000000000000000-000000000fff0000 >> >> 256MB of memory, right? >> >>> [ 0.000000] RAMDISK: 0fd9d000 - 0ffdf539 >>> [ 0.000000] ACPI: RSDP 00000000000fb450 00014 (v00 QEMU ) >>> [ 0.000000] ACPI: RSDT 000000000fff0000 00030 (v01 QEMU QEMURSDT >>> 00000001 QEMU 00000001) >>> [ 0.000000] ACPI: FACP 000000000fff0030 00074 (v01 QEMU QEMUFACP >>> 00000001 QEMU 00000001) >>> [ 0.000000] ACPI: DSDT 000000000fff0100 0089D (v01 BXPC BXDSDT >>> 00000001 INTL 20061109) >>> [ 0.000000] ACPI: FACS 000000000fff00c0 00040 >>> [ 0.000000] ACPI: APIC 000000000fff09d8 00068 (v01 QEMU QEMUAPIC >>> 00000001 QEMU 00000001) >>> [ 0.000000] ACPI: SSDT 000000000fff099d 00037 (v01 QEMU QEMUSSDT >>> 00000001 QEMU 00000001) >>> [ 0.000000] No NUMA configuration found >>> [ 0.000000] Faking a node at 0000000000000000-000000000fff0000 >>> [ 0.000000] Initmem setup node 0 0000000000000000-000000000fff0000 >>> [ 0.000000] NODE_DATA [0000000001c4e040 - 0000000001c5303f] >>> [ 0.000000] BUG: unable to handle kernel NULL pointer dereference at (null) >>> [ 0.000000] IP: [<ffffffff81b0f5f7>] memory_present+0x9a/0xbf >>> [ 0.000000] PGD 0 >>> [ 0.000000] Oops: 0000 [#1] SMP >>> [ 0.000000] last sysfs file: >>> [ 0.000000] CPU 0 >>> [ 0.000000] Modules linked in: >>> [ 0.000000] >>> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.33-mm1+ #1 / >>> [ 0.000000] RIP: 0010:[<ffffffff81b0f5f7>] [<ffffffff81b0f5f7>] >>> memory_present+0x9a/0xbf >>> [ 0.000000] RSP: 0000:ffffffff81a01e18 EFLAGS: 00010046 >>> [ 0.000000] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 >>> [ 0.000000] RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000000000000000 >>> [ 0.000000] RBP: ffffffff81a01e58 R08: ffffffffffffffff R09: 0000000000000040 >>> [ 0.000000] R10: ffff880001c4e040 R11: 0000000000004100 R12: 0000000000000000 >>> [ 0.000000] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 >>> [ 0.000000] FS: 0000000000000000(0000) GS:ffffffff81adf000(0000) >>> knlGS:0000000000000000 >>> [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 0.000000] CR2: 0000000000000000 CR3: 0000000001a08000 CR4: 00000000000000b0 >>> [ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [ 0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>> [ 0.000000] Process swapper (pid: 0, threadinfo ffffffff81a00000, >>> task ffffffff81a10020) >>> [ 0.000000] Stack: >>> [ 0.000000] 000000000fff0000 000000000000009f 0000000000000000 >>> 0000000000000000 >>> [ 0.000000] <0> 0000000000000040 ffffffff81a01ef8 0000000000000000 >>> 0000000000000000 >>> [ 0.000000] <0> ffffffff81a01e78 ffffffff81b0dd0e ffffffff81a01e88 >>> 000000000fff0000 >>> [ 0.000000] Call Trace: >>> [ 0.000000] [<ffffffff81b0dd0e>] >>> sparse_memory_present_with_active_regions+0x31/0x47 >>> [ 0.000000] [<ffffffff81b0688a>] paging_init+0x3f/0x5b >>> [ 0.000000] [<ffffffff81af81a7>] setup_arch+0x964/0xa03 >>> [ 0.000000] [<ffffffff8103014a>] ? need_resched+0x1e/0x28 >>> [ 0.000000] [<ffffffff8103015d>] ? should_resched+0x9/0x2a >>> [ 0.000000] [<ffffffff8152de24>] ? _cond_resched+0x9/0x1d >>> [ 0.000000] [<ffffffff81af4a34>] start_kernel+0x9f/0x382 >>> [ 0.000000] [<ffffffff81af4299>] x86_64_start_reservations+0xa9/0xad >>> [ 0.000000] [<ffffffff81af4383>] x86_64_start_kernel+0xe6/0xed >>> [ 0.000000] Code: c7 00 56 c2 81 e8 a0 f9 a1 ff 48 83 3c dd 00 16 >>> c2 81 00 75 08 4c 89 2c dd 00 16 c2 81 fe 05 11 60 11 00 4c 89 ff e8 >>> 85 3b 5c ff <48> 83 38 00 75 03 4c 89 30 49 81 c4 00 80 00 00 4c 3b 65 >>> c8 72 >>> [ 0.000000] RIP [<ffffffff81b0f5f7>] memory_present+0x9a/0xbf >>> [ 0.000000] RSP <ffffffff81a01e18> >>> [ 0.000000] CR2: 0000000000000000 >>> [ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]--- >>> [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! >>> [ 0.000000] Pid: 0, comm: swapper Tainted: G D 2.6.33-mm1+ #1 >>> [ 0.000000] Call Trace: >>> [ 0.000000] [<ffffffff8103c78c>] panic+0x9e/0x113 >>> [ 0.000000] [<ffffffff8103d3d6>] ? printk+0x67/0x69 >>> [ 0.000000] [<ffffffff8105914e>] ? blocking_notifier_call_chain+0xf/0x11 >>> [ 0.000000] [<ffffffff8103f8b4>] do_exit+0x78/0x70f >>> [ 0.000000] [<ffffffff8103ca2f>] ? spin_unlock_irqrestore+0x9/0xb >>> [ 0.000000] [<ffffffff8103dcde>] ? kmsg_dump+0x112/0x138 >>> [ 0.000000] [<ffffffff81530061>] oops_end+0xb2/0xba >>> [ 0.000000] [<ffffffff810258d3>] no_context+0x1f5/0x204 >>> [ 0.000000] [<ffffffff81025b1b>] __bad_area_nosemaphore+0x17f/0x1a2 >>> [ 0.000000] [<ffffffff81025bb4>] bad_area_nosemaphore+0xe/0x10 >>> [ 0.000000] [<ffffffff81531e36>] do_page_fault+0x122/0x24c >>> [ 0.000000] [<ffffffff8152f59f>] page_fault+0x1f/0x30 >>> [ 0.000000] [<ffffffff81b0f5f7>] ? memory_present+0x9a/0xbf >>> [ 0.000000] [<ffffffff81b0f5f7>] ? memory_present+0x9a/0xbf >>> [ 0.000000] [<ffffffff81b0dd0e>] >>> sparse_memory_present_with_active_regions+0x31/0x47 >>> [ 0.000000] [<ffffffff81b0688a>] paging_init+0x3f/0x5b >>> [ 0.000000] [<ffffffff81af81a7>] setup_arch+0x964/0xa03 >>> [ 0.000000] [<ffffffff8103014a>] ? need_resched+0x1e/0x28 >>> [ 0.000000] [<ffffffff8103015d>] ? should_resched+0x9/0x2a >>> [ 0.000000] [<ffffffff8152de24>] ? _cond_resched+0x9/0x1d >>> [ 0.000000] [<ffffffff81af4a34>] start_kernel+0x9f/0x382 >>> [ 0.000000] [<ffffffff81af4299>] x86_64_start_reservations+0xa9/0xad >>> [ 0.000000] [<ffffffff81af4383>] x86_64_start_kernel+0xe6/0xed >>> >>> The kernel was built with 'make mrproper && make defconfig && make >>> ARCH=x86_64 CONFIG=smp -j 6'. This panic is seen on every attempt, so >>> I can provide more diagnostics. >> >> Okay, if you did defconfig and just hit enter to all questions, you >> should have SPARSEMEM_EXTREME and NO_BOOTMEM enabled. This means that >> the 'mem_section' is an array of pointers and the following happens in >> memory_present(): >> >> for_one_pfn_in_each_section() { >> sparse_index_init(); /* no return value check */ >> ms = __nr_to_section(); >> if (!ms->section_mem_map) /* bang */ >> ...; >> } >> >> where sparse_index_init(), in the SPARSEMEM_EXTREME case, will allocate >> the mem_section descriptor with bootmem. If this would fail, the box >> would panic immediately earlier, but NO_BOOTMEM does not seem to get it >> right. >> >> Greg, could you retry _with_ my bootmem patch applied, but with setting >> CONFIG_NO_BOOTMEM=n up front? >> >> I think NO_BOOTMEM has several problems. Yinghai, can you verify them? >> >> 1. It does not seem to handle goal appropriately: bootmem would try >> without the goal if it does not make sense. And in this case, the >> goal is 4G (above DMA32) and the amount of memory is 256M. >> >> And if I did not miss something, this is the difference with my patch: >> without it, the default goal is 16M, which is no problem as it is well >> within your available memory. But the change of the default goal moved >> it outside it which the bootmem replacement can not handle. >> >> 2. The early reservation stuff seems to return NULL but callsites assume >> that the bootmem interface never does that. Okay, the result is the same, >> we crash. But it still moves error reporting to a possibly much later >> point where somebody actually dereferences the returned pointer. > > related change could be: __alloc_bootmem_node_high... no should be here... static struct mem_section noinline __init_refok *sparse_index_alloc(int nid) { struct mem_section *section = NULL; unsigned long array_size = SECTIONS_PER_ROOT * sizeof(struct mem_section); if (slab_is_available()) { if (node_state(nid, N_HIGH_MEMORY)) section = kmalloc_node(array_size, GFP_KERNEL, nid); else section = kmalloc(array_size, GFP_KERNEL); } else section = alloc_bootmem_node(NODE_DATA(nid), array_size); and #define alloc_bootmem_node(pgdat, x) \ __alloc_bootmem_node(pgdat, x, SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS)) then you change that goal MAX_DMA_ADDRESS to 4g..., but the system only have 256M YH -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>