Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 8 Mar 2024 18:18:17 +0000

On Fri, Mar 08, 2024 at 06:09:25PM +0000, Ryan Roberts wrote:
> I think the world is trying to tell me "its Friday night. Stop". I can no longer
> reproduce the non-NULL mapping oops that I was able to hit reliably this morning.

HEISENBUG!

> I do have this one though:
> 
> [  197.332914] Unable to handle kernel NULL pointer dereference at virtual
> address 0000000000000000
> [  197.340790] pc : deferred_split_scan+0x210/0x260
> [  197.341154] lr : deferred_split_scan+0x70/0x260
> [  197.347534] Call trace:
> [  197.347729]  deferred_split_scan+0x210/0x260
> [  197.348069]  do_shrink_slab+0x184/0x750
> 
> 
> deferred_split_scan+0x210/0x260 is the code that I added back:
> 
> if (!folio_try_get(folio)) {
> 	/* We lost race with folio_put() */
> 	list_del_init(&folio->_deferred_list); <<<< HERE
> 	ds_queue->split_queue_len--;
> 	continue;
> }
> 
> We have the spinlock here so that really should not be happening. So does that
> mean the list is being manipulated outside of the lock somewhere? Or maybe its
> mapping (actually one of the deferred_list pointers being cleared by the buddy?
> I dunno... give up. Will resume on Monday. Have a good weekend.

This is actually congruent with a new theory I have which is that
somewhere/somehow we're freeing the page without taking it off the
deferred list.  I don't see such a path, but if it does exist, we could
absolutely corrupt the deferred_list in this way.  Just working on a
patch to make my detection patch reliable ...

You have a good weekend too!