On 09/11/2014 07:39 AM, Hugh Dickins wrote: > On Wed, 10 Sep 2014, Sasha Levin wrote: >> On 09/10/2014 03:36 PM, Hugh Dickins wrote: >>> Right, and Sasha reports that that can fire, but he sees the bug >>> with this patch in and without that firing. >> >> I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful >> VMA information out, and got the following: > > Well, thanks, but Mel and I have both failed to perceive any actual > problem arising from that peculiarity. And Mel's warning, and the 900s > in yesterday's dumps, have shown that it is not correlated with the > pte_mknuma() bug we are chasing. So there isn't anything that I want to > look up in these vmas. Or did you notice something interesting in them? I thought this was a separate issue that would need taking care of as well. >> And on a maybe related note, I've started seeing the following today. It may >> be because we fixed mbind() in trinity but it could also be related to > > The fixed trinity may be counter-productive for now, since we think > there is an understandable pte_mknuma() bug coming from that direction, > but have not posted a patch for it yet. I'm still seeing the bug with fixed trinity, it was a matter of adding more flags to mbind. >> this issue (free_pgtables() is in the call chain). If you don't think it has >> anything to do with it let me know and I'll start a new thread: >> >> [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at (null) >> [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 lib/rbtree.c:229 lib/rbtree.c:367) >> [ 1196.001744] Call Trace: >> [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24) >> [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232) >> [ 1196.001744] unlink_file_vma (mm/mmap.c:246) >> [ 1196.001744] free_pgtables (mm/memory.c:547) >> [ 1196.001744] exit_mmap (mm/mmap.c:2826) >> [ 1196.001744] mmput (kernel/fork.c:654) >> [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:461 kernel/exit.c:746) > > I didn't study in any detail, but this one seems much more like the > zeroing and vma corruption that you've been seeing in other dumps. > > Though a single pte_mknuma() crash could presumably be caused by vma > corruption (but I think not mere zeroing), the recurrent way in which > you hit that pte_mknuma() bug in particular makes it unlikely to be > caused by random corruption. > > You are generating new crashes faster than we can keep up with them. > Would this be a suitable point for you to switch over to testing > 3.17-rc, to see if that is as unstable for you as -next is? > > That VM_BUG_ON(!(val & _PAGE_PRESENT)) is not in the 3.17-rc tree, > but I think you can "safely" add it to 3.17-rc. Quotes around > "safely" meaning that we know that there's a bug to hit, at least > in -next, but I don't think it's going to be hit for stupid obvious > reasons. I'll try it, usually I just hit a bunch of issues that were already fixed in -next, which is why I try sticking to one tree. > And you're using a gcc 5 these days? That's another variable to > try removing from the mix, to see if it makes a difference. I'm seeing the BUG getting hit with 4.7.2, so I don't think it's compiler dependant. I'll try reproducing everything I reported yesterday with 4.7.2 just in case, but I don't think that this is the issue. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>