Re: Splitting a THP beyond EOF

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 20 Oct 2020 12:21:38 +0100

On Tue, Oct 20, 2020 at 03:59:28PM +1100, Dave Chinner wrote:
> On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:
> > This is a weird one ... which is good because it means the obvious
> > ones have been fixed and now I'm just tripping over the weird cases.
> > And fortunately, xfstests exercises the weird cases.
> > 
> > 1. The file is 0x3d000 bytes long.
> > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> > 3. We simulate a read error for 0x3c000-0x3cfff
> > 4. Userspace writes to 0x3d697 to 0x3dfaa
> 
> So this is a write() beyond EOF, yes?
> 
> If yes, then we first go through this path:
> 
> 	xfs_file_buffered_aio_write()
> 	  xfs_file_aio_write_checks()
> 	    iomap_zero_range(isize, pos - isize)
> 
> To zero the region between the current EOF and where the new write
> starts. i.e. from 0x3d000 to 0x3d696.

Yes.  That calls iomap_write_begin() which calls iomap_split_page()
which is where we run into trouble.  I elided the exact path from the
description of the problem.

> > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
> >    so it calls iomap_split_page() (passing page 0x3d)
> 
> Splitting the page because it's !Uptodate seems rather drastic to
> me.  Why does it need to split the page here?

Because we can't handle Dirty, !Uptodate THPs in the truncate path.
Previous discussion:
https://lore.kernel.org/linux-mm/20200821144021.GV17456@xxxxxxxxxxxxxxxxxxxx/

The current assumption is that a !Uptodate THP is due to a read error,
and so the sensible thing to do is split it and handle read errors at
a single-page level.

I've been playing around with creating THPs in the write path, and that
offers a different pathway to creating Dirty, !Uptodate THPs, so this
may also change at some point.  I'd like to get what I have merged and
then figure out how to make this better.

> Also, this concerns me: if we are exposing the cached EOF page via
> mmap, it needs to contain only zeroes in the region beyond EOF so
> that we don't expose stale data to userspace. Hence when a THP that
> contains EOF is instantiated, we have to ensure that the region
> beyond EOF is compeltely zeroed. It then follows that if we read all
> the data in that THP up to EOF, then the page is actually up to
> date...

We do that in iomap_readpage_actor().  Had the readahead I/O not "failed",
we'd've had an Uptodate THP which straddled EOF.  I dumped the page and
its uptodate bits are 0xfff0 (1kB block size filesystem, the three pages
0x3d-0x3f are uptodate because they were zeroed and 0x3c is !uptodate
because the I/O failed).