Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

Zi Yan <ziy@xxxxxxxxxx> · Mon, 29 Apr 2024 22:43:16 -0400

On 29 Apr 2024, at 20:31, Luis Chamberlain wrote:

> On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
>> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
>>
>>> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
>>>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
>>>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
>>>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
>>>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
>>>>>>>> From: Pankaj Raghav <p.raghav@xxxxxxxxxxx>
>>>>>>>>
>>>>>>>> using that API for LBS is resulting in an NULL ptr dereference
>>>>>>>> error in the writeback path [1].
>>>>>>>>
>>>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
>>>>>>>
>>>>>>>  How would I go about reproducing this?
>>>>
>>>> Well so the below fixes this but I am not sure if this is correct.
>>>> folio_mark_dirty() at least says that a folio should not be truncated
>>>> while its running. I am not sure if we should try to split folios then
>>>> even though we check for writeback once. truncate_inode_partial_folio()
>>>> will folio_wait_writeback() but it will split_folio() before checking
>>>> for claiming to fail to truncate with folio_test_dirty(). But since the
>>>> folio is locked its not clear why this should be possible.
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 83955362d41c..90195506211a 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>  	if (new_order >= folio_order(folio))
>>>>  		return -EINVAL;
>>>>
>>>> -	if (folio_test_writeback(folio))
>>>> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>>>>  		return -EBUSY;
>>>>
>>>>  	if (!folio_test_anon(folio)) {
>>>
>>> I wondered what code path is causing this and triggering this null
>>> pointer, so I just sprinkled a check here:
>>>
>>> 	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
>>>
>>> The answer was:
>>>
>>> kcompactd() --> migrate_pages_batch()
>>>                   --> try_split_folio --> split_folio_to_list() -->
>>> 		       split_huge_page_to_list_to_order()
>>>
>>
>> There are 3 try_split_folio() in migrate_pages_batch().
>
> This is only true for linux-next, for v6.9-rc5 off of which this testing
> is based on there are only two.
>
>> First one is to split anonymous large folios that are on deferred
>> split list, so not related;
>
> This is in linux-next and not v6.9-rc5.
>
>> second one is to split THPs when thp migration is not supported, but
>> this is compaction, so not related; third one is to split large folios
>> when there is no same size free page in the system, and this should be
>> the one.
>
> Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
> also helps us enhance the reproducer further, which I'll do next.
>
>>> And I verified that moving the check only to the migrate_pages_batch()
>>> path also fixes the crash:
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 73a052a382f1..83b528eb7100 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>>>  	int rc;
>>>
>>>  	folio_lock(folio);
>>> +	if (folio_test_dirty(folio)) {
>>> +		rc = -EBUSY;
>>> +		goto out;
>>> +	}
>>>  	rc = split_folio_to_list(folio, split_folios);
>>> +out:
>>>  	folio_unlock(folio);
>>>  	if (!rc)
>>>  		list_move_tail(&folio->lru, split_folios);
>>>
>>> However I'd like compaction folks to review this. I see some indications
>>> in the code that migration can race with truncation but we feel fine by
>>> it by taking the folio lock. However here we have a case where we see
>>> the folio clearly locked and the folio is dirty. Other migraiton code
>>> seems to write back the code and can wait, here we just move on. Further
>>> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
>>> as completely unmovable") seems to hint that migration is safe if the
>>> mapping either does not exist or the mapping does exist but has
>>> mapping->a_ops->migrate_folio so I'd like further feedback on this.
>>
>> During migration, all page table entries pointing to this dirty folio
>> are invalid, and accesses to this folio will cause page fault and
>> wait on the migration entry. I am not sure we need to skip dirty folios.
>
> I see.. thanks!
>
>>> Another thing which requires review is if we we split a folio but not
>>> down to order 0 but to the new min order, does the accounting on
>>> migrate_pages_batch() require changing?  And most puzzling, why do we
>>
>> What accounting are you referring to? split code should take care of it.
>
> The folio order can change after split, and so I was concerned about the
> nr_pages used in migrate_pages_batch(). But I see now that when
> migrate_folio_unmap() first failed we try to split the folio, and if
> successful I see now we the caller will again call migrate_pages_batch()
> with a retry attempt of 1 only to the split folios. I also see the
> nr_pages is just local to each list for each loop, first on the from
> list to unmap and afte on the unmap list so we move the folios.
>
>>> not see this with regular large folios, but we do see it with minorder ?
>>
>> I wonder if the split code handles folio->mapping->i_pages properly.
>> Does the i_pages store just folio pointers or also need all tail page
>> pointers? I am no expert in fs, thus need help.
>
> mapping->i_pages stores folio pointers in the page cache or
> swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
> special and we special-case them with shmem_mapping(mapping) checks.
> split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow
> entries, and we also bail out on shmem_mapping(mapping) already.

Hmm, I misunderstood the issue above. To clarify it, the error comes out
when a page cache folio with minorder is split to order-0, an NULL ptr
defer shows up in the writeback path. I thought the folio was split to
non-0 order. split_huge_page_to_list_to_order() should be fine, since
splitting to order-0 is not changed after my patches.

I wonder if you can isolate the issue by just splitting a dirty minorder
page cache folio instead of having folio split and migration going on together.
You probably can use the debugfs to do that. Depending on the result,
we can narrow down the cause of the issue.

--
Best Regards,
Yan, Zi
Attachment:
signature.asc

Description: OpenPGP digital signature