On Thu, Oct 19, 2023 at 06:55:46PM +0200, Jan Kara wrote: > On Fri 29-09-23 21:45:29, Ojaswin Mujoo wrote: > > On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote: > > > In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned > > > however we only zero out the pagecache range that is block aligned. These > > > functions are relying on ext4_zero_partial_blocks() -> > > > __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in > > > the pageacache. However, the right thing to do is to properly zero out the whole > > > range in these functions before and not rely on a different function to do it > > > for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the > > > complete range. > > > > > > This will also allow us to now exit early for unwritten buffer heads in > > > __ext4_block_zero_page_range(), in upcoming patch. > > > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > > --- > > > fs/ext4/extents.c | 17 +++++++++++------ > > > fs/ext4/inode.c | 3 +-- > > > 2 files changed, 12 insertions(+), 8 deletions(-) > > > > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > > index c79b4c25afc4..2dc681cab6a5 100644 > > > --- a/fs/ext4/extents.c > > > +++ b/fs/ext4/extents.c > > > @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > > > > /* Zero range excluding the unaligned edges */ > > > if (max_blocks > 0) { > > > - flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | > > > - EXT4_EX_NOCACHE); > > > - > > > /* > > > * Prevent page faults from reinstantiating pages we have > > > * released from page cache. > > > @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > * disk in case of crash before zeroing trans is committed. > > > */ > > > if (ext4_should_journal_data(inode)) { > > > - ret = filemap_write_and_wait_range(mapping, start, end - 1); > > > + ret = filemap_write_and_wait_range(mapping, start, > > > + end - 1); > > > > I think this accidentally creeped in, will fix it in next rev. > > Yeah, just pull it in patch 1. > > > > if (ret) { > > > filemap_invalidate_unlock(mapping); > > > goto out_mutex; > > > } > > > } > > > + } > > > > So the above if (max_blocks) {...} block runs when the range spans > > multiple blocks but I think the filemap_write_and_wait_range() and > > ext4_update_disksize_before_punch() should be called when we are actually > > spanning multiple pages, since the disksize not updating issue and the > > truncate racing with checkpoint only happen when the complete page is > > truncated. Is this understanding correct? > > Why do you think the issues apply only to multiple pages? I mean even if a > single block is dirty in memory, it may be pushing i_disksize or carrying > journalled data we need to commit. Hey Jan, You are right, I think I was misunderstanding this code a bit, thinking that these things would only be needed if the complete folio is absent. Upon rechecking the paths like the writeback path I can now see that even if the blocks till i_size are already allocated (eg, through ext4_zero_range) then we won't actually be calling mpage_map_and_submit_extent() which is where disksize updates. > > > > + /* > > > + * Now truncate the pagecache and zero out non page aligned edges of the > > > + * range (if any) > > > + */ > > > + truncate_pagecache_range(inode, offset, offset + len - 1); > > > > > > - /* Now release the pages and zero block aligned part of pages */ > > > - truncate_pagecache_range(inode, start, end - 1); > > > + if (max_blocks > 0) { > > > inode->i_mtime = inode->i_ctime = current_time(inode); > > > > > > + flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE); > > > ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, > > > flags); > > > filemap_invalidate_unlock(mapping); > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > index 6c490f05e2ba..de8ea8430d30 100644 > > > --- a/fs/ext4/inode.c > > > +++ b/fs/ext4/inode.c > > > @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > > ret = ext4_update_disksize_before_punch(inode, offset, length); > > > > In this function ext4_punch_hole() I see that we call > > filemap_write_and_wait_range() and then take the inode_lock() later. > > Doesn't this leave a window for the pages to get dirty again? > > There's definitely a race window but I think the call to > filemap_write_and_wait_range() is a leftover from the past when hole > punching could race in a nasty way. These days we have invalidate_lock so I > *think* we can just remove that filemap_write_and_wait_range() call. At > least I don't see a good reason for it now because the pages are going away > anyway. But it needs testing :). > > > For example, in ext4_zero_range(), we checkpoint using > > filemap_write_and_wait_range() in case of data=journal under > > inode_lock() but that's not the case here. Just wondering if this > > or any other code path might still race here? > > Well, that's a bit different story as the comment there explains. And yes, > invalidate_lock protects us from possible races there. Ahh okay, got it. > > > > if (ret) > > > goto out_dio; > > > - truncate_pagecache_range(inode, first_block_offset, > > > - last_block_offset); > > > } > > > + truncate_pagecache_range(inode, offset, offset + length - 1); > > But I have realized that changes done in this patch actually don't help > with changing ext4_zero_partial_blocks() because as soon as we drop > invalidate_lock, a page fault can come in and modify contents of partial > pages we need zeroed. > > So thinking about this again with fresh mind, these block vs pagecache > consistency issues aren't probably worth it and current code flow is good > enough. Sorry for misleading you. We might just add a comment to > __ext4_block_zero_page_range() to explain that buffer_unwritten() buffers > can get there but they should be already zeroed-out and uptodate and we do > need to process them because of page cache zeroing. What do you think? Oh right, I was not thinking from the mmap path, sorry about that. In that case I think your point makes sense, lets just let this be for now. I'll send a v2 with the first patch of the series and also add a comment as you suggested. Thanks for the review and taking the time to answer my questions! Regards, ojaswin > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR