On Fri, 13 Jul 2018 16:34:49 -0700 Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Jul 13, 2018 at 4:13 PM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > It does seem to be related to low-memory situation. Maybe page-out. > > I'm wondering if it's one of the fairly scary MM patches from this > > merge window > > Woo-hoo! Yes, I got it to happen in text-mode. > > kernel BUG at mm/page_alloc.c:2016 > > with the call chain being > > RIP: move_pfreepages_block() > Call Trace: > steal_suitable_fallback > get_page_from_freelist > __alloc_pages_nodemask > new_slab > ___slab_alloc > __slab_alloc > kmem_cache_alloc > __d_alloc > d_alloc > ... > > (and then it goes down to sys_openat and path lookup). > > I actually used the dcache stress-tester and a stupid "allocate memory > and keep dirtying it" to get low on memory, and that d_alloc because > of that. > > And then when VM_BUG_ON() causes a do_exit(), you get a nested > exception due to "sleeping function called from invalid context" in > exit_)signals. And then the machine is well and truly dead and f*cked. > > I hate BUG_ON() calls. I wonder how many weeks ago it was that I > complained about people adding BUG_ON() calls last? > > Anyway, looks like core VM buggery. Now, I don't know *which* one of > the multiple tests in that VM_BUG_ON() triggered, They all did: VM_BUG_ON(pfn_valid(page_to_pfn(start_page)) && pfn_valid(page_to_pfn(end_page)) && page_zone(start_page) != page_zone(end_page)); > and I have no idea > which commit caused it, but at least non-VM people can probably > breathe a sigh of release., > Andrew, I suspect it's some of yours. Adding Willy, because some of > the scariest ones in the VM layer are from him (like thall those page > member movement ones). > Cc's added. Pavel has been fiddling with this code lately. The comment is interesting. /* * page_zone is not safe to call in this context when * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant * anyway as we check zone boundaries in move_freepages_block(). * Remove at a later date when no bug reports exist related to * grouping pages by mobility */ but we should work out why we're suddenly getting a range which crosses zones before we just zap it. (But it would be interesting to see whether removing the check "fixes" it)