On 19 Oct 2020, at 21:43, Matthew Wilcox wrote:
This is a weird one ... which is good because it means the obvious
ones have been fixed and now I'm just tripping over the weird cases.
And fortunately, xfstests exercises the weird cases.
1. The file is 0x3d000 bytes long.
2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
3. We simulate a read error for 0x3c000-0x3cfff
4. Userspace writes to 0x3d697 to 0x3dfaa
5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
so it calls iomap_split_page() (passing page 0x3d)
6. iomap_split_page() calls split_huge_page()
7. split_huge_page() sees that page 0x3d is beyond EOF, so it removes
it
from i_pages
8. iomap_write_actor() copies the data into page 0x3d
I’m guessing that iomap_write_begin() is still in charge of locking
the pages, and that iomap_split_page()->split_huge_page() is just
reusing that lock?
It sounds like you’re missing a flag to iomap_split_page() that says:
I care about range A->B, even if its beyond EOF. IOW,
iomap_write_begin()’s path should be in charge of doing the right
thing for the write, without relying on the rest of the kernel to avoid
upsetting it.
9. The write is lost.
Trying to persuade XFS to update i_size before calling
iomap_file_buffered_write() seems like a bad idea.
Changing split_huge_page() to disregard i_size() is something I kind
of want to be able to do long-term in order to make hole-punch more
efficient, but that seems like a lot of work right now.
The problem with trusting i_size is that it changes at surprising times.
For this code inside split_huge_page(), end == i_size_read()
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
/* Some pages can be beyond i_size: drop them from page
cache */
if (head[i].index >= end) {
ClearPageDirty(head + i);
But, we actually change i_size after dropping all the page locks. In
xfs this is xfs_setattr_size()->truncate_setsize(), all of which means
that dropping PageDirty seems unwise if this code is running
concurrently with an expanding truncate. If i_size jumps past the page
where you’re clearing dirty, it probably won’t be good. Ignore me
if this is already handled differently, it just seems error prone in
current Linus.
-chris