On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote: > This is a weird one ... which is good because it means the obvious > ones have been fixed and now I'm just tripping over the weird cases. > And fortunately, xfstests exercises the weird cases. > > 1. The file is 0x3d000 bytes long. > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff > 3. We simulate a read error for 0x3c000-0x3cfff > 4. Userspace writes to 0x3d697 to 0x3dfaa So this is a write() beyond EOF, yes? If yes, then we first go through this path: xfs_file_buffered_aio_write() xfs_file_aio_write_checks() iomap_zero_range(isize, pos - isize) To zero the region between the current EOF and where the new write starts. i.e. from 0x3d000 to 0x3d696. > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate > so it calls iomap_split_page() (passing page 0x3d) Splitting the page because it's !Uptodate seems rather drastic to me. Why does it need to split the page here? Also, this concerns me: if we are exposing the cached EOF page via mmap, it needs to contain only zeroes in the region beyond EOF so that we don't expose stale data to userspace. Hence when a THP that contains EOF is instantiated, we have to ensure that the region beyond EOF is compeltely zeroed. It then follows that if we read all the data in that THP up to EOF, then the page is actually up to date... And hence, I don't see why we'd need to split it just because we got a small, unaligned write beyond EOF.... > 6. iomap_split_page() calls split_huge_page() > 7. split_huge_page() sees that page 0x3d is beyond EOF, so it removes it > from i_pages > 8. iomap_write_actor() copies the data into page 0x3d > 9. The write is lost. > > Trying to persuade XFS to update i_size before calling > iomap_file_buffered_write() seems like a bad idea. Agreed, I can't see anything good coming from doing that... > Changing split_huge_page() to disregard i_size() is something I kind > of want to be able to do long-term in order to make hole-punch more > efficient, but that seems like a lot of work right now. > > I think the easiest way to fix this is to decline to allocate readahead > pages beyond EOF. That is, if we have a file which is, say, 61 pages > long, read the last 5 pages into an order-2 THP and an order-0 THP > instead of allocating an order-3 THP and zeroing the last three pages. I think you need to consider zeroing the THP range beyond EOF when doing readahead.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx