On Sat, Aug 22, 2020 at 07:06:18AM +0100, Christoph Hellwig wrote: > On Fri, Aug 21, 2020 at 03:40:21PM +0100, Matthew Wilcox wrote: > > I have only bad ideas for solving this problem, so I thought I'd share. > > > > Operations like holepunch or truncate call into > > truncate_inode_pages_range() which just remove THPs which are > > entirely within the punched range, but pages which contain one or both > > ends of the range need to be split. > > > > What I have right now, and works, calls do_invalidatepage() from > > truncate_inode_pages_range(). The iomap implementation just calls > > iomap_page_release(). We have the page locked, and we've waited for > > writeback to finish, so there is no I/O outstanding against the page. > > We may then get into one of the writepage methods without an iomap being > > allocated, so we have to allocate it. Patches here: > > > > http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601 > > http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5 > > > > If the page is Uptodate, this works great. But if it's not, then we're > > going to unnecessarily re-read data from storage -- indeed, we may as > > well just dump the entire page if it's !Uptodate. Of course, right now > > the only way to get THPs is through readahead and that's going to always > > read the entire page, so we're never going to see a !Uptodate THP. But > > in the future, maybe we shall? I wouldn't like to rely on that without > > pasting some big warnings for anyone who tries to change it. > > > > Alternative 1: Allocate a new iop for each base page if needed. This is > > the obvious approach. If the block size is >= PAGE_SIZE, we don't even > > need to allocate a new iop; we can just mark the page as being Uptodate > > if that range is. The problem is that we might need to allocate a lot of > > them -- 512 if we're splitting a 2MB page into 4kB chunks (which would > > be 12kB -- plus a 2kB array to hold 512 pointers). And at the point > > where we know we need to allocate them, we're under a spin_lock_irq(). > > We could allocate them in advance, but then we might find out we aren't > > going to split this page after all. > > > > Alternative 2: Always allocate an iop for each base page in a THP. We pay > > the allocation price up front for every THP, even though the majority > > will never be split. It does save us from allocating any iop at all for > > block size >= page size, but it's a lot of extra overhead to gather all > > the Uptodate bits. As above, it'd be 12kB of iops vs 80 bytes that we > > currently allocate for a 2MB THP. 144 once we also track dirty bits. > > > > Alternative 3: Allow pages to share an iop. Do something like add a > > pgoff_t base and a refcount_t to the iop and have each base page point > > to the same iop, using the part of the bit array indexed by (page->index > > - base) * blocks_per_page. The read/write count are harder to share, > > and I'm not sure I see a way to do it. > > > > Alternative 4: Don't support partially-uptodate THPs. We declare (and > > enforce with strategic assertions) that all THPs must always be Uptodate > > (or Error) once unlocked. If your workload benefits from using THPs, > > you want to do big I/Os anyway, so these "write 512 bytes at a time > > using O_SYNC" workloads aren't going to use THPs. > > > > Funnily, buffer_heads are easier here. They just need to be reparented > > to their new page. Of course, they're allocated up front, so they're > > essentially alternative 2. > > At least initially I'd go for 4. And then see if someone screams loudly > enough to reconsider. And if we really have to I wonder if we can do > a variation of variant 1 where we avoid allocating under the irqs > disabled spinlock by a clever retry trick. /me doesn't have any objection to #4, and bets #1 and #3 will probably lead to weird problems /somewhere/ ;) --D