On Thu, Nov 12, 2015 at 09:36:14AM +0900, Minchan Kim wrote: <snip> > > > mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for > > > 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no > > > MADV_FREE code in there > > > + pte_mkdirty patch > > > + freeze/unfreeze patch > > > + do_page_add_anon_rmap patch > > > + above split_huge_pmd > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS > > > BUG: Bad rss-counter state mm:ffff88007fa3bb80 idx:1 val:512 > > > > With the patch below my test setup run for 2+ days without triggering the > > bug. split_huge_pmd patch should be dropped. > > > > Please test. > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 14cbbad54a3e..7aa0a3fef2aa 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2841,9 +2841,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > write = pmd_write(*pmd); > > young = pmd_young(*pmd); > > > > - /* leave pmd empty until pte is filled */ > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > - > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > > pmd_populate(mm, &_pmd, pgtable); > > > > @@ -2893,6 +2890,28 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > } > > > > smp_wmb(); /* make pte visible before pmd */ > > + /* > > + * Up to this point the pmd is present and huge and userland has the > > + * whole access to the hugepage during the split (which happens in > > + * place). If we overwrite the pmd with the not-huge version pointing > > + * to the pte here (which of course we could if all CPUs were bug > > + * free), userland could trigger a small page size TLB miss on the > > + * small sized TLB while the hugepage TLB entry is still established in > > + * the huge TLB. Some CPU doesn't like that. > > + * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum > > + * 383 on page 93. Intel should be safe but is also warns that it's > > + * only safe if the permission and cache attributes of the two entries > > + * loaded in the two TLB is identical (which should be the case here). > > + * But it is generally safer to never allow small and huge TLB entries > > + * for the same virtual address to be loaded simultaneously. So instead > > + * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the > > + * current pmd notpresent (atomically because here the pmd_trans_huge > > + * and pmd_trans_splitting must remain set at all times on the pmd > > + * until the split is complete for this pmd), then we flush the SMP TLB > > + * and finally we write the non-huge version of the pmd entry with > > + * pmd_populate. > > + */ > > + pmdp_invalidate(vma, haddr, pmd); > > pmd_populate(mm, pmd, pgtable); > > > > if (freeze) { > > I have been tested this patch with MADV_DONTNEED for a few days and > I couldn't see the problem any more. And I will continue to test it > with MADV_FREE. During the test with MADV_FREE on kernel I applied your patches, I couldn't see any problem. However, in this round, I did another test which is same one I attached but a liitle bit different because it doesn't do (memcg things/kill/swapoff) for testing program long-live test. With that, I encountered this problem. page:ffffea0000f60080 count:1 mapcount:0 mapping:ffff88007f584691 index:0x600002a02 flags: 0x400000000006a028(uptodate|lru|writeback|swapcache|reclaim|swapbacked) page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff880077cf0c00 ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:3340! invalid opcode: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 7 PID: 1657 Comm: memhog Not tainted 4.3.0-rc5-mm1-madv-free+ #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b0f1a40 ti: ffff88004ced4000 task.ti: ffff88004ced4000 RIP: 0010:[<ffffffff8114bf67>] [<ffffffff8114bf67>] split_huge_page_to_list+0x907/0x920 RSP: 0018:ffff88004ced7a38 EFLAGS: 00010296 RAX: 0000000000000021 RBX: ffffea0000f60080 RCX: ffffffff81830db8 RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8 RBP: ffff88004ced7ab8 R08: 0000000000000000 R09: ffff8800000bc560 R10: ffffffff8163d880 R11: 0000000000014f25 R12: ffffea0000f60080 R13: ffffea0000f60088 R14: ffffea0000f60080 R15: 0000000000000000 FS: 00007f43d3ced740(0000) GS:ffff8800782e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff1f6fcdb98 CR3: 000000004cf56000 CR4: 00000000000006a0 Stack: cccccccccccccccd ffffea0000f60080 ffff88004ced7ad0 ffffea0000f60088 ffff88004ced7ad0 0000000000000000 ffff88004ced7ab8 ffffffff810ef9d0 ffffea0000f60000 0000000000000000 0000000000000000 ffffea0000f60080 Call Trace: [<ffffffff810ef9d0>] ? __lock_page+0xa0/0xb0 [<ffffffff8114c09c>] deferred_split_scan+0x11c/0x260 [<ffffffff81117bfc>] ? list_lru_count_one+0x1c/0x30 [<ffffffff81101333>] shrink_slab.part.42+0x1e3/0x350 [<ffffffff81105daa>] shrink_zone+0x26a/0x280 [<ffffffff81105eed>] do_try_to_free_pages+0x12d/0x3b0 [<ffffffff81106224>] try_to_free_pages+0xb4/0x140 [<ffffffff810f8a59>] __alloc_pages_nodemask+0x459/0x920 [<ffffffff8111e667>] handle_mm_fault+0xc77/0x1000 [<ffffffff8142718d>] ? retint_kernel+0x10/0x10 [<ffffffff81033629>] __do_page_fault+0x189/0x400 [<ffffffff810338ac>] do_page_fault+0xc/0x10 [<ffffffff81428142>] page_fault+0x22/0x30 Code: ff ff 48 c7 c6 f0 b2 77 81 4c 89 f7 e8 13 c3 fc ff 0f 0b 48 83 e8 01 e9 88 f7 ff ff 48 c7 c6 70 a1 77 81 4c 89 f7 e8 f9 c2 fc ff <0f> 0b 48 c7 c6 38 af 77 81 4c 89 e7 e8 e8 c2 fc ff 0f 0b 66 0f RIP [<ffffffff8114bf67>] split_huge_page_to_list+0x907/0x920 RSP <ffff88004ced7a38> ---[ end trace c9a60522e3a296e4 ]--- So, I reverted all MADV_FREE patches and chaged it with MADV_DONTNEED. In this time, I saw below oops in this time. If I miss somethings, please let me know it. ------------[ cut here ]------------ kernel BUG at include/linux/swapops.h:129! invalid opcode: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 5 PID: 1563 Comm: madvise_test Not tainted 4.3.0-rc5-mm1-no-madv-free+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88007e8d3480 ti: ffff88007f524000 task.ti: ffff88007f524000 RIP: 0010:[<ffffffff811504be>] [<ffffffff811504be>] migration_entry_to_page.part.61+0x4/0x6 RSP: 0018:ffff88007f527cd0 EFLAGS: 00010246 RAX: ffffea0000896b00 RBX: 00006000013ac000 RCX: ffffea0000000000 RDX: 0000000000000000 RSI: ffffea0001f93e80 RDI: 3e000000000225ac RBP: ffff88007f527cd0 R08: 0000000000000101 R09: ffff88007e4fa000 R10: ffffea0001fda740 R11: 0000000000000000 R12: 00000000044b583e R13: 00006000013ad000 R14: ffff88007f527e00 R15: ffff88007e4fad60 FS: 00007fe2f099a740(0000) GS:ffff8800782a0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000166c0d0 CR3: 000000007e57b000 CR4: 00000000000006a0 Stack: ffff88007f527db8 ffffffff81118030 00006000017fffff ffff88007f527e00 00006000017fffff ffff88007ed71000 ffff88007e57b600 0000600001800000 0000600001800000 00006000017fffff 0000600001800000 ffff88007efb6b78 Call Trace: [<ffffffff81118030>] unmap_single_vma+0x840/0x880 [<ffffffff811188a1>] unmap_vmas+0x41/0x60 [<ffffffff8111dfad>] unmap_region+0x9d/0x100 [<ffffffff81120007>] do_munmap+0x217/0x380 [<ffffffff811201b1>] vm_munmap+0x41/0x60 [<ffffffff811210d2>] SyS_munmap+0x22/0x30 [<ffffffff81420357>] entry_SYSCALL_64_fastpath+0x12/0x6a Code: df 48 c1 ff 06 49 01 fc 4c 89 e7 e8 9c ff ff ff 85 c0 74 0c 4c 89 e0 48 c1 e0 06 48 29 d8 eb 02 31 c0 5b 41 5c 5d c3 55 48 89 e5 <0f> 0b 55 48 c7 c6 30 80 77 81 48 89 e5 e8 f0 45 fc ff 0f 0b 55 RIP [<ffffffff811504be>] migration_entry_to_page.part.61+0x4/0x6 RSP <ffff88007f527cd0> ---[ end trace 01097fb7f9cf1b6c ]--- Another hit: page:ffffea0000520080 count:2 mapcount:0 mapping:ffff880072b38a51 index:0x600002602 flags: 0x4000000000048028(uptodate|lru|swapcache|swapbacked) page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff880077cf0c00 ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:3306! invalid opcode: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 6 PID: 1419 Comm: madvise_test Not tainted 4.3.0-rc5-mm1-no-madv-free+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006f108000 ti: ffff88006f054000 task.ti: ffff88006f054000 RIP: 0010:[<ffffffff811473bf>] [<ffffffff811473bf>] split_huge_page_to_list+0x81f/0x890 RSP: 0000:ffff88006f057a40 EFLAGS: 00010282 RAX: 0000000000000021 RBX: ffffea0000520080 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821dd418 RBP: ffff88006f057ab8 R08: 0000000000000000 R09: ffff8800000bfb20 R10: ffffffff8163d1c0 R11: 0000000000005c5f R12: ffff88006f057ad0 R13: ffffea0000520080 R14: ffffea0000520080 R15: 0000000000000000 FS: 00007f09963a2740(0000) GS:ffff8800782c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000600003d92000 CR3: 000000007372e000 CR4: 00000000000006a0 Stack: ffffea0000520080 ffff88006f057ad0 ffffea0000520088 ffff88006f057ad0 0000000000000000 ffff88006f057ab8 ffffffff810ec700 ffffea0000520000 0000000000000000 0000000000000000 ffffea0000520080 ffff88006f057ad0 Call Trace: [<ffffffff810ec700>] ? __lock_page+0xa0/0xb0 [<ffffffff81147545>] deferred_split_scan+0x115/0x240 [<ffffffff8111445c>] ? list_lru_count_one+0x1c/0x30 [<ffffffff810fdd63>] shrink_slab.part.43+0x1e3/0x350 [<ffffffff81102788>] shrink_zone+0x238/0x250 [<ffffffff811028cd>] do_try_to_free_pages+0x12d/0x3b0 [<ffffffff81102c04>] try_to_free_pages+0xb4/0x140 [<ffffffff810f57b9>] __alloc_pages_nodemask+0x459/0x920 [<ffffffff8111aa2a>] handle_mm_fault+0xbca/0xf90 [<ffffffff8105b8bc>] ? enqueue_task+0x3c/0x60 [<ffffffff810602eb>] ? __set_cpus_allowed_ptr+0x9b/0x1a0 [<ffffffff81032b49>] __do_page_fault+0x189/0x400 [<ffffffff81032dcc>] do_page_fault+0xc/0x10 [<ffffffff81421e02>] page_fault+0x22/0x30 Code: ff ff 48 c7 c6 d0 91 77 81 4c 89 f7 e8 1b d7 fc ff 0f 0b 48 83 e8 01 e9 70 f8 ff ff 48 c7 c6 50 80 77 81 4c 89 f7 e8 01 d7 fc ff <0f> 0b 48 c7 c6 d8 be 77 81 4c 89 ef e8 f0 d6 fc ff 0f 0b 48 83 RIP [<ffffffff811473bf>] split_huge_page_to_list+0x81f/0x890 RSP <ffff88006f057a40> ---[ end trace 0ce8751b8410cd8e ]--- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>