On Tue, Oct 20, 2020 at 03:59:28PM +1100, Dave Chinner wrote: > On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote: > > This is a weird one ... which is good because it means the obvious > > ones have been fixed and now I'm just tripping over the weird cases. > > And fortunately, xfstests exercises the weird cases. > > > > 1. The file is 0x3d000 bytes long. > > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff > > 3. We simulate a read error for 0x3c000-0x3cfff > > 4. Userspace writes to 0x3d697 to 0x3dfaa > > So this is a write() beyond EOF, yes? > > If yes, then we first go through this path: > > xfs_file_buffered_aio_write() > xfs_file_aio_write_checks() > iomap_zero_range(isize, pos - isize) > > To zero the region between the current EOF and where the new write > starts. i.e. from 0x3d000 to 0x3d696. Yes. That calls iomap_write_begin() which calls iomap_split_page() which is where we run into trouble. I elided the exact path from the description of the problem. > > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate > > so it calls iomap_split_page() (passing page 0x3d) > > Splitting the page because it's !Uptodate seems rather drastic to > me. Why does it need to split the page here? Because we can't handle Dirty, !Uptodate THPs in the truncate path. Previous discussion: https://lore.kernel.org/linux-mm/20200821144021.GV17456@xxxxxxxxxxxxxxxxxxxx/ The current assumption is that a !Uptodate THP is due to a read error, and so the sensible thing to do is split it and handle read errors at a single-page level. I've been playing around with creating THPs in the write path, and that offers a different pathway to creating Dirty, !Uptodate THPs, so this may also change at some point. I'd like to get what I have merged and then figure out how to make this better. > Also, this concerns me: if we are exposing the cached EOF page via > mmap, it needs to contain only zeroes in the region beyond EOF so > that we don't expose stale data to userspace. Hence when a THP that > contains EOF is instantiated, we have to ensure that the region > beyond EOF is compeltely zeroed. It then follows that if we read all > the data in that THP up to EOF, then the page is actually up to > date... We do that in iomap_readpage_actor(). Had the readahead I/O not "failed", we'd've had an Uptodate THP which straddled EOF. I dumped the page and its uptodate bits are 0xfff0 (1kB block size filesystem, the three pages 0x3d-0x3f are uptodate because they were zeroed and 0x3c is !uptodate because the I/O failed).