page split failures in truncate_inode_pages_range

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 23 Jun 2021 19:51:18 +0100

When we have large pages in the page cache, we can end up in
truncate_inode_pages_range() with an 'lstart' that is in the middle of
a tail page.  My approach has generally been to split the large page,
and that works except when split_huge_page() fails, which it can do at
random due to a racing access having the page refcount elevated.

I've been simulating split_huge_page() failures, and found a problem
I don't know how to solve.  truncate_inode_pages_range() is called
by COLLAPSE_RANGE in order to evict the part of the page cache after
the start of the range being collapsed (any part of the page cache
remaining would now have data for the wrong part of the file in it).
xfs_flush_unmap_range() (and I presume the other filesystems which
support COLLAPSE_RANGE) calls filemap_write_and_wait_range() first,
so we can just drop the partial large page if split doesn't succeed.

But truncate_inode_pages_range() is also called by, for example,
truncate().  In that case, nobody calls filemap_write_and_wait_range(),
so we can't discard the page because it might still be dirty.
Is that an acceptable way to choose behaviour -- if the split fails,
discard the page if it's clean and keep it if it's dirty?  I'll
put a great big comment on it, because it's not entirely obvious.