On Sun, Mar 10, 2024 at 04:31:25PM +0000, Ryan Roberts wrote: > That's exactly how I discovered the original problem, and was hoping > that with your fix, this would unblock me. Given I can only repro this > when my changes are on top, I guess my code is most likely buggy, > but perhaps you can take a quick look at the oops and tell me what > you think? Well, now my code isn't implicated, I have no interest in helping you. Just kidding ;-) > [ 96.372503] BUG: Bad page state in process usemem pfn:be502 > [ 96.373336] page: refcount:0 mapcount:0 mapping:000000005abfa8d5 index:0x0 pfn:0xbe502 > [ 96.374341] aops:0x0 ino:fffffc0001f940c8 > [ 96.374893] flags: 0x7fff8000000000(node=0|zone=0|lastcpupid=0xffff) > [ 96.375653] page_type: 0xffffffff() > [ 96.376071] raw: 007fff8000000000 0000000000000000 fffffc0001f94090 ffff0000c99ee860 > [ 96.377055] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > [ 96.378650] page dumped because: non-NULL mapping OK, so page->mapping is ffff0000c99ee860 which does look plausible. At least it's not a deferred_list (although it is a pfn suitable for having a deferred_list ... for any allocation up to order-9) > [ 96.390688] dump_stack_lvl+0x78/0xc8 > [ 96.391163] dump_stack+0x18/0x28 > [ 96.391545] bad_page+0x88/0x128 > [ 96.391893] get_page_from_freelist+0xa94/0x1bc0 > [ 96.392407] __alloc_pages+0x194/0x10b0 > [ 113.131515] ------------[ cut here ]------------ > [ 113.132190] UBSAN: array-index-out-of-bounds in mm/vmscan.c:1654:14 > [ 113.132892] index 7 is out of range for type 'long unsigned int [5]' > [ 113.133617] CPU: 9 PID: 528 Comm: kswapd0 Tainted: G B 6.8.0-rc5-ryarob01-swap-out-v4 #2 > [ 113.134705] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 > [ 113.135500] Call trace: > [ 113.135776] dump_backtrace+0x9c/0x128 > [ 113.136218] show_stack+0x20/0x38 > [ 113.136574] dump_stack_lvl+0x78/0xc8 > [ 113.136964] dump_stack+0x18/0x28 > [ 113.137322] __ubsan_handle_out_of_bounds+0xa0/0xd8 > [ 113.137885] isolate_lru_folios+0x57c/0x658 I wish it weren't UBSAN reporting this, then we could get the folio dumped. I suppose we could put in an explicit check for folio_zonenum() being > 5. Does it usually happed in isolate_lru_folio()? > nr_skipped is a stack array of 5 elements. So I guess folio_zonemem(folio) is returning 7. That comes from the flags. I guess this is most likely just a side effect of the corrupted folio due to someone writing to it while its on the free list? Or it's a pointer to something that's not a folio? Are we taking the wrong lock somewhere again?