On Fri, Aug 21, 2020 at 03:40:21PM +0100, Matthew Wilcox wrote: > I have only bad ideas for solving this problem, so I thought I'd share. > > Operations like holepunch or truncate call into > truncate_inode_pages_range() which just remove THPs which are > entirely within the punched range, but pages which contain one or both > ends of the range need to be split. > > What I have right now, and works, calls do_invalidatepage() from > truncate_inode_pages_range(). The iomap implementation just calls > iomap_page_release(). We have the page locked, and we've waited for > writeback to finish, so there is no I/O outstanding against the page. > We may then get into one of the writepage methods without an iomap being > allocated, so we have to allocate it. Patches here: > > http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601 > http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5 > > If the page is Uptodate, this works great. But if it's not, then we're > going to unnecessarily re-read data from storage -- indeed, we may as > well just dump the entire page if it's !Uptodate. Of course, right now > the only way to get THPs is through readahead and that's going to always > read the entire page, so we're never going to see a !Uptodate THP. But > in the future, maybe we shall? I wouldn't like to rely on that without > pasting some big warnings for anyone who tries to change it. > > Alternative 1: Allocate a new iop for each base page if needed. This is > the obvious approach. If the block size is >= PAGE_SIZE, we don't even > need to allocate a new iop; we can just mark the page as being Uptodate > if that range is. The problem is that we might need to allocate a lot of > them -- 512 if we're splitting a 2MB page into 4kB chunks (which would > be 12kB -- plus a 2kB array to hold 512 pointers). And at the point > where we know we need to allocate them, we're under a spin_lock_irq(). > We could allocate them in advance, but then we might find out we aren't > going to split this page after all. > > Alternative 2: Always allocate an iop for each base page in a THP. We pay > the allocation price up front for every THP, even though the majority > will never be split. It does save us from allocating any iop at all for > block size >= page size, but it's a lot of extra overhead to gather all > the Uptodate bits. As above, it'd be 12kB of iops vs 80 bytes that we > currently allocate for a 2MB THP. 144 once we also track dirty bits. > > Alternative 3: Allow pages to share an iop. Do something like add a > pgoff_t base and a refcount_t to the iop and have each base page point > to the same iop, using the part of the bit array indexed by (page->index > - base) * blocks_per_page. The read/write count are harder to share, > and I'm not sure I see a way to do it. > > Alternative 4: Don't support partially-uptodate THPs. We declare (and > enforce with strategic assertions) that all THPs must always be Uptodate > (or Error) once unlocked. If your workload benefits from using THPs, > you want to do big I/Os anyway, so these "write 512 bytes at a time > using O_SYNC" workloads aren't going to use THPs. > > Funnily, buffer_heads are easier here. They just need to be reparented > to their new page. Of course, they're allocated up front, so they're > essentially alternative 2. At least initially I'd go for 4. And then see if someone screams loudly enough to reconsider. And if we really have to I wonder if we can do a variation of variant 1 where we avoid allocating under the irqs disabled spinlock by a clever retry trick.