On 2024/5/7 7:19, Dave Chinner wrote: > On Mon, May 06, 2024 at 07:44:44PM +0800, Zhang Yi wrote: >> On 2024/5/1 16:33, Dave Chinner wrote: >>> On Wed, May 01, 2024 at 06:11:13PM +1000, Dave Chinner wrote: >>>> On Wed, Apr 10, 2024 at 10:29:38PM +0800, Zhang Yi wrote: >>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>>> >>>>> Implement buffered write iomap path, use ext4_da_map_blocks() to map >>>>> delalloc extents and add ext4_iomap_get_blocks() to allocate blocks if >>>>> delalloc is disabled or free space is about to run out. >>>>> >>>>> Note that we always allocate unwritten extents for new blocks in the >>>>> iomap write path, this means that the allocation type is no longer >>>>> controlled by the dioread_nolock mount option. After that, we could >>>>> postpone the i_disksize updating to the writeback path, and drop journal >>>>> handle in the buffered dealloc write path completely. >>> ..... >>>>> +/* >>>>> + * Drop the staled delayed allocation range from the write failure, >>>>> + * including both start and end blocks. If not, we could leave a range >>>>> + * of delayed extents covered by a clean folio, it could lead to >>>>> + * inaccurate space reservation. >>>>> + */ >>>>> +static int ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset, >>>>> + loff_t length) >>>>> +{ >>>>> + ext4_es_remove_extent(inode, offset >> inode->i_blkbits, >>>>> + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb))); >>>>> return 0; >>>>> } >>>>> >>>>> +static int ext4_iomap_buffered_write_end(struct inode *inode, loff_t offset, >>>>> + loff_t length, ssize_t written, >>>>> + unsigned int flags, >>>>> + struct iomap *iomap) >>>>> +{ >>>>> + handle_t *handle; >>>>> + loff_t end; >>>>> + int ret = 0, ret2; >>>>> + >>>>> + /* delalloc */ >>>>> + if (iomap->flags & IOMAP_F_EXT4_DELALLOC) { >>>>> + ret = iomap_file_buffered_write_punch_delalloc(inode, iomap, >>>>> + offset, length, written, ext4_iomap_punch_delalloc); >>>>> + if (ret) >>>>> + ext4_warning(inode->i_sb, >>>>> + "Failed to clean up delalloc for inode %lu, %d", >>>>> + inode->i_ino, ret); >>>>> + return ret; >>>>> + } >>>> >>>> Why are you creating a delalloc extent for the write operation and >>>> then immediately deleting it from the extent tree once the write >>>> operation is done? >>> >>> Ignore this, I mixed up the ext4_iomap_punch_delalloc() code >>> directly above with iomap_file_buffered_write_punch_delalloc(). >>> >>> In hindsight, iomap_file_buffered_write_punch_delalloc() is poorly >>> named, as it is handling a short write situation which requires >>> newly allocated delalloc blocks to be punched. >>> iomap_file_buffered_write_finish() would probably be a better name >>> for it.... >>> >>>> Also, why do you need IOMAP_F_EXT4_DELALLOC? Isn't a delalloc iomap >>>> set up with iomap->type = IOMAP_DELALLOC? Why can't that be used? >>> >>> But this still stands - the first thing >>> iomap_file_buffered_write_punch_delalloc() is: >>> >>> if (iomap->type != IOMAP_DELALLOC) >>> return 0; >>> >> >> Thanks for the suggestion, the delalloc and non-delalloc write paths >> share the same ->iomap_end() now (i.e. ext4_iomap_buffered_write_end()), >> I use the IOMAP_F_EXT4_DELALLOC to identify the write path. > > Again, you don't need that. iomap tracks newly allocated > IOMAP_DELALLOC extents via the IOMAP_F_NEW flag that should be > getting set in the ->iomap_begin() call when it creates a new > delalloc extent. > > Please look at the second check in > iomap_file_buffered_write_punch_delalloc(): > > if (iomap->type != IOMAP_DELALLOC) > return 0; > > /* If we didn't reserve the blocks, we're not allowed to punch them. */ > if (!(iomap->flags & IOMAP_F_NEW)) > return 0; > >> For >> non-delalloc path, If we have allocated more blocks and copied less, we >> should truncate extra blocks that newly allocated by ->iomap_begin(). > > Why? If they were allocated as unwritten, then you can just leave > them there as unwritten extents, same as XFS. Keep in mind that if > we get a short write, it is extremely likely the application is > going to rewrite the remaining data immediately, so if we allocated > blocks they are likely to still be needed, anyway.... > Make sense, we don't need to free the extra blocks beyond EOF since they are unwritten, we can drop this handle for non-delalloc path on ext4 now. >> If we use IOMAP_DELALLOC, we can't tell if the blocks are >> pre-existing or newly allocated, we can't truncate the >> pre-existing blocks, so I have to introduce IOMAP_F_EXT4_DELALLOC. >> But if we split the delalloc and non-delalloc handler, we could >> drop IOMAP_F_EXT4_DELALLOC. > > As per above: IOMAP_F_NEW tells us -exactly- this. > > IOMAP_F_NEW should be set on any newly allocated block - delalloc or > real - because that's the flag that tells the iomap infrastructure > whether zero-around is needed for partial block writes. If ext4 is > not setting this flag on delalloc regions allocated by > ->iomap_begin(), then that's a serious bug. > >> I also checked xfs, IIUC, xfs doesn't free the extra blocks beyond EOF >> in xfs_buffered_write_iomap_end() for non-delalloc case since they will >> be freed by xfs_free_eofblocks in some other inactive paths, like >> xfs_release()/xfs_inactive()/..., is that right? > > XFS doesn't care about real blocks beyond EOF existing - > xfs_free_eofblocks() is an optimistic operation that does not > guarantee that it will remove blocks beyond EOF. Similarly, we don't > care about real blocks within EOF because we alway allocate data > extents as unwritten, so we don't have any stale data exposure > issues to worry about on short writes leaving allocated blocks > behind. > > OTOH, delalloc extents without dirty page cache pages over them > cannot be allowed to exist. Without dirty pages, there is no trigger > to convert those to real extents (i.e. nothing to write back). Hence > the only sane thing that can be done with them on a write error or > short write is remove them in the context where they were created. > > This is the only reason that the > iomap_file_buffered_write_punch_delalloc() exists - it abstracts > this nasty corner case away from filesystems that support delalloc > so they don't have to worry about getting this right. That's whole > point of having delalloc aware infrastructure - individual > filesysetms don't need to handle all these weird corner cases > themselves because the infrastructure takes care of them... > Yeah, thanks for the explanation. The iomap_file_buffered_write_punch_delalloc() is very useful, it find pages that have dirty data still pending in the page cache, punch out all the delalloc blocks beside those blocks. I realized that it is used to fix a race condition between either writeback or mmap page faults that xfs encountered [1]. We will meet the same problem for ext3 and ext2 which are not extent based. Their new allocated blocks were written, we need to free them if we get a short write, but we can't simply do it through ext2_write_failed() and ext4_truncate_failed_write(), we still need to use iomap_file_buffered_write_punch_delalloc(). [1] https://lore.kernel.org/all/20221123055812.747923-6-david@xxxxxxxxxxxxx/ Thanks, Yi.