Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Mon, 29 Apr 2024 17:31:04 -0700

On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
> 
> > On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> >> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> >>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> >>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> >>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> >>>>>> From: Pankaj Raghav <p.raghav@xxxxxxxxxxx>
> >>>>>>
> >>>>>> using that API for LBS is resulting in an NULL ptr dereference
> >>>>>> error in the writeback path [1].
> >>>>>>
> >>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> >>>>>
> >>>>>  How would I go about reproducing this?
> >>
> >> Well so the below fixes this but I am not sure if this is correct.
> >> folio_mark_dirty() at least says that a folio should not be truncated
> >> while its running. I am not sure if we should try to split folios then
> >> even though we check for writeback once. truncate_inode_partial_folio()
> >> will folio_wait_writeback() but it will split_folio() before checking
> >> for claiming to fail to truncate with folio_test_dirty(). But since the
> >> folio is locked its not clear why this should be possible.
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 83955362d41c..90195506211a 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>  	if (new_order >= folio_order(folio))
> >>  		return -EINVAL;
> >>
> >> -	if (folio_test_writeback(folio))
> >> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> >>  		return -EBUSY;
> >>
> >>  	if (!folio_test_anon(folio)) {
> >
> > I wondered what code path is causing this and triggering this null
> > pointer, so I just sprinkled a check here:
> >
> > 	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
> >
> > The answer was:
> >
> > kcompactd() --> migrate_pages_batch()
> >                   --> try_split_folio --> split_folio_to_list() -->
> > 		       split_huge_page_to_list_to_order()
> >
> 
> There are 3 try_split_folio() in migrate_pages_batch().

This is only true for linux-next, for v6.9-rc5 off of which this testing
is based on there are only two.

> First one is to split anonymous large folios that are on deferred
> split list, so not related;

This is in linux-next and not v6.9-rc5.

> second one is to split THPs when thp migration is not supported, but
> this is compaction, so not related; third one is to split large folios
> when there is no same size free page in the system, and this should be
> the one.

Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
also helps us enhance the reproducer further, which I'll do next.

> > And I verified that moving the check only to the migrate_pages_batch()
> > path also fixes the crash:
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 73a052a382f1..83b528eb7100 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> >  	int rc;
> >
> >  	folio_lock(folio);
> > +	if (folio_test_dirty(folio)) {
> > +		rc = -EBUSY;
> > +		goto out;
> > +	}
> >  	rc = split_folio_to_list(folio, split_folios);
> > +out:
> >  	folio_unlock(folio);
> >  	if (!rc)
> >  		list_move_tail(&folio->lru, split_folios);
> >
> > However I'd like compaction folks to review this. I see some indications
> > in the code that migration can race with truncation but we feel fine by
> > it by taking the folio lock. However here we have a case where we see
> > the folio clearly locked and the folio is dirty. Other migraiton code
> > seems to write back the code and can wait, here we just move on. Further
> > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> > as completely unmovable") seems to hint that migration is safe if the
> > mapping either does not exist or the mapping does exist but has
> > mapping->a_ops->migrate_folio so I'd like further feedback on this.
> 
> During migration, all page table entries pointing to this dirty folio
> are invalid, and accesses to this folio will cause page fault and
> wait on the migration entry. I am not sure we need to skip dirty folios.

I see.. thanks!

> > Another thing which requires review is if we we split a folio but not
> > down to order 0 but to the new min order, does the accounting on
> > migrate_pages_batch() require changing?  And most puzzling, why do we
> 
> What accounting are you referring to? split code should take care of it.

The folio order can change after split, and so I was concerned about the
nr_pages used in migrate_pages_batch(). But I see now that when
migrate_folio_unmap() first failed we try to split the folio, and if
successful I see now we the caller will again call migrate_pages_batch()
with a retry attempt of 1 only to the split folios. I also see the
nr_pages is just local to each list for each loop, first on the from
list to unmap and afte on the unmap list so we move the folios.

> > not see this with regular large folios, but we do see it with minorder ?
> 
> I wonder if the split code handles folio->mapping->i_pages properly.
> Does the i_pages store just folio pointers or also need all tail page
> pointers? I am no expert in fs, thus need help.

mapping->i_pages stores folio pointers in the page cache or
swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
special and we special-case them with shmem_mapping(mapping) checks.
split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow 
entries, and we also bail out on shmem_mapping(mapping) already.

  Luis