On Tue, 2019-03-05 at 14:42 +0000, Mel Gorman wrote: > On Mon, Mar 04, 2019 at 10:55:04PM -0500, Qian Cai wrote: > > Reverted the patches below from linux-next seems fixed a crash while running > > LTP > > oom01. > > > > 915c005358c1 mm, compaction: Capture a page under direct compaction -fix > > e492a5711b67 mm, compaction: capture a page under direct compaction > > > > Especially, just removed this chunk along seems fixed the problem. > > > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -2227,10 +2227,10 @@ compact_zone(struct compact_control *cc, struct > > capture_control *capc) > > } > > > > /* Stop if a page has been captured */ > > - if (capc && capc->page) { > > - ret = COMPACT_SUCCESS; > > - break; > > - } > > > > It's hard to make sense of how this is connected to the bug. The > out-of-bounds warning would have required page flags to be corrupted > quite badly or maybe the use of an uninitialised page. How reproducible > has this been for you? I just ran the test 100 times with UBSAN and page > alloc debugging enabled and it completed correctly. > I did manage to reproduce this every time by running oom01 within 3 tries on this x86_64 server and was unable to reproduce on arm64 and ppc64le servers so far. # for i in `seq 1 3`; do /opt/ltp/testcases/bin/oom01 ; done Sometimes, it could trigger different traces. [ 391.704320] SLUB: Unable to allocate memory on node -1, gfp=0x800(GFP_NOWAIT) [ 391.737794] cache: kmalloc-64, object size: 64, buffer size: 416, default order: 2, min order: 0 [ 391.778079] node 0: slabs: 5999, objs: 232851, free: 16 [ 391.802926] node 1: slabs: 4303, objs: 167067, free: 37 [ 499.866479] ------------[ cut here ]------------ [ 499.866500] BUG: Bad page state in process oom01 pfn:fffffe7a09fffd07 [ 499.890013] kernel BUG at mm/page_alloc.c:3124! [ 499.935430] double fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 499.971334] CPU: 0 PID: 1623 Comm: oom01 Tainted: G W 5.0.0-next-20190305+ #49 [ 499.992805] ================================================================================ [ 500.009887] Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017 [ 500.009901] RIP: 0010:check_memory_region+0x10/0x1e0 [ 500.048252] UBSAN: Undefined behaviour in kernel/locking/qspinlock.c:138:9 [ 500.085378] Code: 00 00 00 48 89 e5 e8 ff 3e 9f 00 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 85 f6 0f 84 68 01 00 00 55 0f b6 d2 48 89 e5 <41> 55 41 54 53 e9 b3 00 00 00 48 b8 00 00 00 00 00 00 00 ff 48 39 [ 500.107608] index 8190 is out of range for type 'long unsigned int [256]' [ 500.138462] RSP: 0000:ffff888428f80000 EFLAGS: 00010002 [ 500.223186] CPU: 42 PID: 0 Comm: swapper/42 Tainted: G W 5.0.0-next-20190305+ #49 [ 500.253922] RAX: ffff88827fff41c0 RBX: ffff88827fff41c8 RCX: ffffffff9c0a9468 [ 500.253925] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88827fff41f8 [ 500.277367] Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017 [ 500.277370] Call Trace: [ 500.318081] RBP: ffff888428f80000 R08: ffffed104fffe840 R09: ffffed104fffe83f [ 500.318085] R10: ffffed104fffe83f R11: ffff88827fff41fb R12: ffff88827fff41f8 [ 500.349838] <IRQ> [ 500.381765] R13: ffff88827fff41c8 R14: ffff88842a96f770 R15: ffff88827fff41c8 [ 500.381768] FS: 00007fdfd3559700(0000) GS:ffff8881f3c00000(0000) knlGS:0000000000000000 [ 500.424074] dump_stack+0x62/0x9a [ 500.435452] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 500.435455] CR2: ffff888428f7fff8 CR3: 000000041abca003 CR4: 00000000001606b0 [ 500.467546] ubsan_epilogue+0xd/0x7f [ 500.500039] Call Trace: [ 500.500042] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables xfs sd_mod ahci igb libahci i2c_algo_bit i2c_core libata dm_mirror dm_region_hash dm_log dm_mod efivarfs [ 500.509058] __ubsan_handle_out_of_bounds+0x14d/0x192 [ 500.541152] ---[ end trace f9ff2b89b6b88c5f ]--- [ 500.541155] invalid opcode: 0000 [#2] SMP DEBUG_PAGEALLOC KASAN PTI [ 500.541159] CPU: 10 PID: 262 Comm: kcompactd0 Tainted: G D W 5.0.0-next-20190305+ #49 [ 500.541161] Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017 [ 500.541167] RIP: 0010:__isolate_free_page+0x464/0x600 [ 500.541170] Code: 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c6 20 6f 0b 9d 48 89 df e8 4a 8b f8 ff 0f 0b 48 c7 c7 a0 32 69 9d e8 51 40 43 00 <0f> 0b 48 c7 c7 e0 31 69 9d e8 43 40 43 00 48 c7 c6 80 71 0b 9d 48 [ 500.541172] RSP: 0000:ffff8881f1fdf848 EFLAGS: 00010002 [ 500.541175] RAX: 00000000f0000080 RBX: ffffea00064fc000 RCX: ffff88827fff41d0 [ 500.541177] RDX: 1ffffd4000c9f806 RSI: 0000000000000008 RDI: ffffffff9d9f1640 [ 500.541179] RBP: ffff8881f1fdf898 R08: ffffea00064fc000 R09: ffff8881f1fdfd30 [ 500.541181] R10: 0000000000000002 R11: 1ffff1104fffe83b R12: 0000000000000008 [ 500.541183] R13: dffffc0000000000 R14: ffff88827fff3000 R15: 0000000000000002 [ 500.541185] FS: 0000000000000000(0000) GS:ffff8881f4100000(0000) knlGS:0000000000000000 [ 500.541188] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 500.541190] CR2: 00007fdce416a000 CR3: 000000026ea16002 CR4: 00000000001606a0 [ 500.541191] Call Trace: [ 500.541199] compaction_alloc+0x886/0x25f0 [ 500.541221] unmap_and_move+0x37/0x1e70 [ 500.541228] migrate_pages+0x2ca/0xb20 [ 500.541238] compact_zone+0x19cb/0x3620 [ 500.541252] kcompactd_do_work+0x2df/0x680