Splitting an iomap_page

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 21 Aug 2020 15:40:21 +0100

I have only bad ideas for solving this problem, so I thought I'd share.

Operations like holepunch or truncate call into
truncate_inode_pages_range() which just remove THPs which are
entirely within the punched range, but pages which contain one or both
ends of the range need to be split.

What I have right now, and works, calls do_invalidatepage() from
truncate_inode_pages_range().  The iomap implementation just calls
iomap_page_release().  We have the page locked, and we've waited for
writeback to finish, so there is no I/O outstanding against the page.
We may then get into one of the writepage methods without an iomap being
allocated, so we have to allocate it.  Patches here:

http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601
http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5

If the page is Uptodate, this works great.  But if it's not, then we're
going to unnecessarily re-read data from storage -- indeed, we may as
well just dump the entire page if it's !Uptodate.  Of course, right now
the only way to get THPs is through readahead and that's going to always
read the entire page, so we're never going to see a !Uptodate THP.  But
in the future, maybe we shall?  I wouldn't like to rely on that without
pasting some big warnings for anyone who tries to change it.

Alternative 1: Allocate a new iop for each base page if needed.  This is
the obvious approach.  If the block size is >= PAGE_SIZE, we don't even
need to allocate a new iop; we can just mark the page as being Uptodate
if that range is.  The problem is that we might need to allocate a lot of
them -- 512 if we're splitting a 2MB page into 4kB chunks (which would
be 12kB -- plus a 2kB array to hold 512 pointers).  And at the point
where we know we need to allocate them, we're under a spin_lock_irq().
We could allocate them in advance, but then we might find out we aren't
going to split this page after all.

Alternative 2: Always allocate an iop for each base page in a THP.  We pay
the allocation price up front for every THP, even though the majority
will never be split.  It does save us from allocating any iop at all for
block size >= page size, but it's a lot of extra overhead to gather all
the Uptodate bits.  As above, it'd be 12kB of iops vs 80 bytes that we
currently allocate for a 2MB THP.  144 once we also track dirty bits.

Alternative 3: Allow pages to share an iop.  Do something like add a
pgoff_t base and a refcount_t to the iop and have each base page point
to the same iop, using the part of the bit array indexed by (page->index
- base) * blocks_per_page.  The read/write count are harder to share,
and I'm not sure I see a way to do it.

Alternative 4: Don't support partially-uptodate THPs.  We declare (and
enforce with strategic assertions) that all THPs must always be Uptodate
(or Error) once unlocked.  If your workload benefits from using THPs,
you want to do big I/Os anyway, so these "write 512 bytes at a time
using O_SYNC" workloads aren't going to use THPs.

Funnily, buffer_heads are easier here.  They just need to be reparented
to their new page.  Of course, they're allocated up front, so they're
essentially alternative 2.