On Fri, Aug 02, 2019 at 01:23:06PM -0700, Andrew Morton wrote: > > [259701.387365] BUG: Bad page state in process Xorg pfn:2a300 > > [259701.393593] page:ffffea0000a8c000 refcount:0 mapcount:-128 > > mapping:0000000000000000 index:0x0 mapcount -128 is PAGE_MAPCOUNT_RESERVE, aka PageBuddy. I think somebody called put_page() once more than they should have. The one before this caused it to be freed to the page allocator, which set PageBuddy. Then this one happened and we got a complaint. > > [259701.402832] flags: 0x2000000000000000() > > [259701.407426] raw: 2000000000000000 ffffffff822ab778 ffffea0000a8f208 > > 0000000000000000 > > [259701.415900] raw: 0000000000000000 0000000000000003 00000000ffffff7f > > 0000000000000000 > > [259701.424373] page dumped because: nonzero mapcount It occurs to me that when a page is freed, we could record some useful bits of information in the page from the stack trace to help debug double-free situations. Even just stashing __builtin_return_address in page->mapping would be helpful, I think. > > [259701.549382] Call Trace: > > [259701.549382] dump_stack+0x46/0x60 > > [259701.549382] bad_page.cold.28+0x81/0xb4 > > [259701.549382] __free_pages_ok+0x236/0x240 > > [259701.549382] __ttm_dma_free_page+0x2f/0x40 > > [259701.549382] ttm_dma_unpopulate+0x29b/0x370 > > [259701.549382] ttm_tt_destroy.part.6+0x44/0x50 > > [259701.549382] ttm_bo_cleanup_memtype_use+0x29/0x70 > > [259701.549382] ttm_bo_put+0x225/0x280 > > [259701.549382] ttm_bo_vm_close+0x10/0x20 > > [259701.549382] remove_vma+0x20/0x40 > > [259701.549382] __do_munmap+0x2da/0x420 > > [259701.549382] __vm_munmap+0x66/0xc0 > > [259701.549382] __x64_sys_munmap+0x22/0x30 > > [259701.549382] do_syscall_64+0x5e/0x1a0 > > [259701.549382] ? prepare_exit_to_usermode+0x75/0xa0 > > [259701.549382] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [259701.549382] RIP: 0033:0x7f504d0ec1d7 > > [259701.549382] Code: 10 e9 67 ff ff ff 0f 1f 44 00 00 48 8b 15 b1 6c 0c 00 f7 > > d8 64 89 02 48 c7 c0 ff ff ff ff e9 6b ff ff ff b8 0b 00 00 00 0f 05 <48> 3d 01 > > f0 ff ff 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48 > > [259701.549382] RSP: 002b:00007ffe529db138 EFLAGS: 00000206 ORIG_RAX: > > 000000000000000b > > [259701.549382] RAX: ffffffffffffffda RBX: 0000564a5eabce70 RCX: > > 00007f504d0ec1d7 > > [259701.549382] RDX: 00007ffe529db140 RSI: 0000000000400000 RDI: > > 00007f5044b65000 > > [259701.549382] RBP: 0000564a5eafe460 R08: 000000000000000b R09: > > 000000010283e000 > > [259701.549382] R10: 0000000000000001 R11: 0000000000000206 R12: > > 0000564a5e475b08 > > [259701.549382] R13: 0000564a5e475c80 R14: 00007ffe529db190 R15: > > 0000000000000c80 > > [259701.707238] Disabling lock debugging due to kernel taint > > I assume the above is misbehaviour in the DRM code?