Re: Splitting a THP beyond EOF

"Chris Mason" <clm@xxxxxx> · Tue, 20 Oct 2020 10:32:59 -0400

On 19 Oct 2020, at 21:43, Matthew Wilcox wrote:

This is a weird one ... which is good because it means the obvious
ones have been fixed and now I'm just tripping over the weird cases.
And fortunately, xfstests exercises the weird cases.

1. The file is 0x3d000 bytes long.
2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
3. We simulate a read error for 0x3c000-0x3cfff
4. Userspace writes to 0x3d697 to 0x3dfaa
5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
   so it calls iomap_split_page() (passing page 0x3d)
6. iomap_split_page() calls split_huge_page()
7. split_huge_page() sees that page 0x3d is beyond EOF, so it removes 
it
   from i_pages
8. iomap_write_actor() copies the data into page 0x3d

I’m guessing that iomap_write_begin() is still in charge of locking 
the pages, and that iomap_split_page()->split_huge_page() is just 
reusing that lock?

It sounds like you’re missing a flag to iomap_split_page() that says: 
I care about range A->B, even if its beyond EOF.  IOW, 
iomap_write_begin()’s path should be in charge of doing the right 
thing for the write, without relying on the rest of the kernel to avoid 
upsetting it.

9. The write is lost.

Trying to persuade XFS to update i_size before calling
iomap_file_buffered_write() seems like a bad idea.

Changing split_huge_page() to disregard i_size() is something I kind
of want to be able to do long-term in order to make hole-punch more
efficient, but that seems like a lot of work right now.

The problem with trusting i_size is that it changes at surprising times. 
 For this code inside split_huge_page(), end == i_size_read()

        for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
                __split_huge_page_tail(head, i, lruvec, list);
                /* Some pages can be beyond i_size: drop them from page 
cache */
                if (head[i].index >= end) {
                        ClearPageDirty(head + i);

But, we actually change i_size after dropping all the page locks.  In 
xfs this is xfs_setattr_size()->truncate_setsize(), all of which means 
that dropping PageDirty seems unwise if this code is running 
concurrently with an expanding truncate.  If i_size jumps past the page 
where you’re clearing dirty, it probably won’t be good.  Ignore me 
if this is already handled differently, it just seems error prone in 
current Linus.

-chris